What you’ll do
At Doctolib, we're on a mission to transform healthcare through the power of AI. As a
Senior Data Engineer
, you'll play a key role in building and optimizing the data foundations within the AI Team to deliver safe, scalable, and impactful models.
You will join a dedicated team working on data infrastructure for
LLM, VLM and RAG-based systems
, powering our new
AI Medical Companion
.
Your work will ensure that our engineers and data scientists can
train, evaluate, and deploy AI models
efficiently on high-quality, well-structured, and compliant data.
Your responsibilities include but are not limited to:
-
Ensure high standards of data quality for AI model inputs.
-
Design, build, and maintain scalable data pipelines
on
Google Cloud Platform (GCP)
for AI and machine learning use cases.
-
Implement data ingestion and transformation frameworks
that power
Retrieval
systems and
training datasets
for LLMs and multimodal models.
-
Architect and manage NoSQL and Vector Databases
to store and retrieve embeddings, documents, and model inputs efficiently.
-
Collaborate with ML and platform teams
to define data schemas, partitioning strategies, and governance rules that ensure privacy, scalability, and reliability.
-
Integrate unstructured and structured data sources
(text, speech,image, documents, metadata) into unified data models ready for AI consumption.
-
Optimize performance and cost
of data pipelines using GCP native services (BigQuery, Dataflow, Pub/Sub, Cloud Storage, Vertex AI).
-
Contribute to data quality and lineage frameworks
, ensuring AI models are trained on validated, auditable, and compliant datasets.
-
Continuously
evaluate and improve our data stack
to accelerate AI experimentation and deployment.
Who you are
You could be our next teammate if you have:
-
Master’s or Ph.D. degree in Computer Science, Data Engineering, or a related field.
-
5+ years of experience
in Data Engineering, ideally supporting
AI or ML workloads
.
-
Strong experience with the
GCP data ecosystem
-
Proficiency in
Python and SQL
, with experience in
data pipeline orchestration
(e.g., Airflow, Dagster, Cloud Composer).
-
Deep understanding of
NoSQL
systems (e.g., MongoDB) and
vector databases
(e.g., FAISS, Vector Search).
-
Experience designing
data architectures for RAG, embeddings, or model training pipelines
.
-
Knowledge of
data governance, security, and compliance
for sensitive or regulated data.
-
Familiarity with
W
&
B
/ MLflow / Braintrust / DVC
for experiment tracking and
dataset versioning
(extract snapshots, change tracking, reproducibility).
-
Familiarity with (Docker, Kubernetes) and
CI/CD for data workflows
.
containerized environments
-
A collaborative mindset and passion for building the data foundations of next-generation AI systems.
What we offer
-
Free health insurance for you and your children
-
Parent Care Program: receive one additional month of leave on top of the legal parental leave
-
Free mental health and coaching services through our partner Moka.care
-
For caregivers and workers with disabilities, a package including remote policy adaptations, extra days off, and psychological support
-
Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
-
Work Council subsidy to refund part of your sport club membership or creative class
-
Up to 14 days of RTT
-
Lunch voucher with Swile card
The interview process
-
HR Screen
-
Technical Deep Dive
-
System Design
-
Behavioral Interview
-
Reference check and criminal records check
Offer!