Lead Data Engineer (Databricks)

Sandton, Johannesburg, South Africa
Full-Time
Hybrid

Apply now Refer

Job Description:

Position: Lead Data Engineer

Contract Type: Fixed term / Contract

Contract Duration: Start Date: 25 May 2026 – End Date: December 2026

Work Model: Hybrid (2-3 days a week)

Work Location: Sandton, Johannesburg, South Africa (Hybrid / Office-based as required)

Role Overview

We are seeking a Lead / Senior Data Engineer to design, build, and operate modern Databricks and Lakehouse data platforms that support advanced analytics, AI, and Generative AI use cases.

This role is a senior individual contributor position, operating within product-aligned, cross‑functional squads. The successful candidate will deliver high-quality, governed, scalable data assets consumed by analytics platforms, machine learning models, and Generative AI solutions, including LLM- and agent-based systems.

Key Responsibilities

1. Databricks & Data Platform Engineering

Design, build, and operate data solutions using Databricks, including:

Delta Lake
Databricks Jobs and Workflows
Unity Catalog
Notebooks and shared libraries
Develop scalable, reliable Lakehouse architectures supporting analytics and AI workloads.

2. Data Enablement & Consumption

Enable data consumption for:

Generative AI use cases (e.g. Retrieval-Augmented Generation, AI services, agent workflows)
Analytics and reporting platforms
Downstream operational and business systems
Support feature-style and curated data access patterns required by AI and GenAI workloads.

3. Generative AI Data Enablement

Build and maintain data pipelines that feed Generative AI applications, including:

Curated knowledge and reference datasets
Structured and semi-structured data sources
Metadata, lineage, and traceability for AI consumption
Enable common GenAI data patterns such as:
Retrieval Augmented Generation (RAG)
Contextual and prompt data preparation
Model input, output, and feedback data flows

4. Engineering Standards & Best Practices

Develop production-grade data pipelines using:

Python
SQL
Apache Spark
Implement automated testing, CI/CD, and deployment practices for data workloads.
Ensure data solutions are:
Observable
Resilient
Performant
Cost-efficient
Continuously improve data quality, reliability, and operational stability.

5. Collaboration & Ways of Working

Act as a senior engineer within a cross-functional product squad.
Collaborate closely with:
Product Owners
AI / Machine Learning Engineers
Analytics teams
Platform and security teams
Provide engineering input into design discussions and delivery decisions.
Support peer reviews and contribute to shared engineering standards.
Provide mentorship and technical guidance, including involvement in AI Engineer development.

6. Risk, Governance & Run

Ensure all data solutions comply with enterprise security, risk, and governance standards.
Support the operational stability of data pipelines used by analytics and AI workloads.
Participate in incident resolution and root cause analysis.
Maintain appropriate technical documentation and runbooks.

Required Background & Experience:

10–15 years of industry experience in data engineering or related fields.
5+ years' operating as a Senior or Lead Data Engineer.
Mandatory Technical Skills (with minimum experience)
Databricks (hands-on): 2+ years
Enterprise data lake / lakehouse architecture: 5+ years
Python: 5+ years
SQL: 5+ years
Apache Spark: 5+ years
Production-grade data platforms: 3+ years
Enterprise or regulated environments: 5+ years

Mandatory Skills Summary:

Databricks
Data lake and lakehouse architecture
Python
SQL
Apache Spark
Production-grade data platforms
Enterprise or regulated environments

Desirable / Beneficial Skills:

Experience enabling AI, ML, or Generative AI use cases from a data engineering perspective

Familiarity with:

RAG data patterns
Feature-style or AI-serving datasets
Vector-based or embedding-ready data workflows
Experience working in Agile, product-aligned squads
Exposure to cloud-native data platforms such as AWS or Azure

Desired Skills Summary:

AI, ML, or Generative AI
RAG data patterns
Feature-style or AI-serving datasets
Vector or embedding-ready data workflows
Cloud-native data platforms (AWS or Azure)