Building a Cloud Native Data Platform Like a Product

Building a Cloud Native Data Platform isn’t just about tools. With today’s cloud services, you can quickly spin up a working datalake: you just need storage (Amazon S3), a technical data catalog (Glue Data Catalog), and a query engine (Athena). But while building a datalake is relatively easy, industrialising data processing jobs (extraction, transformation) is another story.

We tackled this challenge by transforming our datalake into a full-fledged Data Platform. In other words, treating it as a software product, designed with modular architecture and product management best practices, to support our client’s growing portfolio of data use cases.

This reflection led us to three main pillars, which we’ll explore in this article:

The Data Processing as a Microservice pattern
Building a Cloud Native Data Platform Framework
Managing the platform with software product practices

1. Data Processing as a Microservice

The Limits of a Monolithic Approach

One model for industrialising data jobs is to centralise everything into a single processing block, made up of:

Compute (e.g., EMR Serverless running Spark)
An IAM role defining permissions
The code and dependencies, packaged in a Docker image

This architecture creates issues quickly:

Cost traceability: All jobs run on the same infrastructure, making it hard to split costs by domain or business use case.
Noisy neighbours: A single resource-hungry job can impact all others in production.
Scalability: Upgrading infrastructure (say, Spark versioning or code dependencies) affects every job at once, risking a global outage if something breaks.

The Microservice Approach

Instead, the Data Processing as a Microservice pattern creates one independent block (compute + IAM + code) per job. Each job becomes its own “autonomous” microservice.

The benefits are clear:

Costs are traceable by domain or business use case.
No more noisy neighbours: one job crashing doesn’t take others down.
Independent upgrades: infrastructure can evolve job by job, reducing global risk.

And the cost? No higher than the monolithic approach. With serverless compute, resources only incur costs when active (punctually running 50 serverless clusters doesn’t cost more than one).

2. Building a Cloud Native Data Platform Framework

The microservice pattern has a downside: each new job often means rewriting Terraform, Airflow, and AWS config—creating code duplication.

Our solution was to factorise this logic into a Data Platform Framework, built on three key components:

Airflow DAG Generator
Like many teams, we use Airflow for orchestration. Writing DAGs was repetitive, so we automated it. A Python library now generates DAGs from YAML config. Data Engineers describe what they need, and the framework builds the DAG.
Pipeline Factory
A Terraform module deploys the full block (infra + IAM role + Docker image) from the YAML config we talked about earlier. This removes IaC duplication and speeds up new pipeline creation.
Datalake SDK
A Python SDK gives engineers a toolbox of standard utilities, from which here are some examples:
- Automatic secret retrieval (AWS Secrets Manager), to easily access source credentials in the extraction code, for example
- Simplified Dataframe writes to the lake: all the Data Engineer has to do is to return the Dataframe, the Datalake SDK handles the rest
- SQL-as-code execution, à la dbt: the Data Engineers can give a file containing an SQL query, the Datalake SDK will execute the query and write the result in an output table in the Datalake

With this framework, engineers focus on business logic while the platform handles infra and boilerplate code.

Example: Pipeline Configuration

To make it more tangible, here’s an example of a data pipeline configuration, interpreted by the Airflow DAG Generator and the Pipeline Factory:

A YAML config defines:

The pipeline configuration first defines the DAG settings, such as the schedule interval (which sets the time and frequency of execution) and the option to specify a Slack tag to be notified when the DAG fails.
It then defines the task configuration—for instance, a CustomLambdaSensor, a custom Operator we developed that encapsulates a Sensor within an AWS Lambda function.
This task configuration also includes infrastructure details (in this case, an ECS cluster). The setup depends on the type of infrastructure used: EMR Serverless for big data jobs or ECS for smaller ones. Since data volumes vary greatly depending on the source, we implemented a hybrid compute system—giving Data Engineers the option to run distributed processing with Spark or non-distributed processing with Pandas to optimise costs.
The configuration also specifies the tables read and written by each task. This enables the Pipeline Factory to generate the corresponding IAM roles with least privilege on data and to build the Airflow DAG (for example, Task A writes to Table 1, Task B reads from Table 1, so Task B depends on Task A).

Example: Job Code

To make it more concrete, here’s an example of a data processing job using the Datalake SDK:

Using the Datalake SDK:

The Datalake SDK is bundled as a dependency in every processing job, giving Data Engineers access to a ready-made toolbox of data features.
Another feature of the Framework is writing data to the datalake. The Data Engineer simply returns the Dataframe they want to write, and the Datalake SDK handles the upload. This abstraction layer makes it easy to implement underlying functionality without impacting the engineers. For example, we added anomaly detection on ingestion volumes, which checks if the ingested volume is abnormal compared to previous runs and raises an alert if necessary. This feature was deployed without affecting users—they didn’t need to make any changes to their pipelines to benefit from it.

The Datalake SDK has a dual implementation: one part for Spark and another for Pandas, depending on the infrastructure chosen by the Data Engineer. Since the interface contract is nearly identical between the two, it’s relatively easy for engineers to start a job on ECS and later migrate it to EMR Serverless if they realise the data volume is larger than expected.

Benefits

Abstraction of complexity: less Terraform/Airflow code, easier hiring, faster pipeline dev, shorter use cases time-to-prod
Consistency: shared code ensures uniform practices
Centralised scalability: framework upgrades (e.g., switching to Iceberg tables) apply everywhere automatically

3. Managing the Platform Like a Software Product

Building such a framework means building a software product. Its sustainability depends on product management best practices:

Automated Testing

The framework must be automatically tested with each update to ensure that every release can be deployed confidently (giving reasonable assurance that it works). To achieve this, we created a test data pipeline that exercises all the features of the Data Platform. This test pipeline is redeployed and executed with every framework change, and if all tests pass, the release can go out.

Versioned Releases

Releases are designed to ensure the stability of existing production pipelines despite changes to the Data Platform.

Imagine if all pipelines used the latest version of the framework. Every new platform feature would automatically propagate to all production pipelines. And any bug that slipped through functional tests could break all jobs at once.

To prevent this, the framework publishes versioned releases (e.g., v1.2.3) and allows pipelines to pin the version they use. This way, a new release doesn’t impact existing pipelines Data Engineers can upgrade when they have bandwidth in order to benefit from the latest framework features.

Now, imagine some pipelines are still using an older version and a bug is discovered. The Data Platform team can release a fix in a new version, but engineers may not have the bandwidth to upgrade immediately, potentially creating a significant migration effort depending on the breaking changes. To help users in this situation, we provide operational maintenance (MCO)—publishing fixes on demand for older versions without forcing pipeline upgrades. At the same time, to manage the platform team’s workload, we limit the lifespan of old versions by decommissioning outdated releases.

Stable Interface Contracts

When implementing an abstraction layer, the way users interact with the underlying logic is called an interface contract. If the Data Platform team changes this interface (for example, renaming a parameter in a Datalake SDK function), Data Engineers must update their code to upgrade, which—if it happens too frequently—can cause frustration. This may lead engineers to delay updates, resulting in the gradual “ageing” of pipelines in production.

Therefore, it is essential to build a robust and stable interface contract that minimises breaking changes and facilitates version upgrades for users.

Thoughtful Feature Selection

This best practice involves balancing two well-known software principles: DRY versus YAGNI.

Part of the goal in building the Data Platform Framework was to share code to avoid redundancy (DRY – Don’t Repeat Yourself). Once you start sharing, it can be tempting to abstract everything, “just in case” it might be useful. But sometimes you realise that what you’ve abstracted is only used in a single pipeline (or worse, not at all), making the added functionality unnecessary (YAGNI – You Aren’t Gonna Need It).

Adding a feature to a framework increases complexity. Framework complexity makes it:

less intuitive (for both users and maintainers)
harms maintainability (larger codebase)
and scalability (harder to add new features).

In other words, sometimes declining to implement a requested feature (or implementing it differently than requested) can be the best choice to ensure the long-term sustainability of the framework.

Conclusion: Should You Build Your Own Data Platform?

This may seem like an odd question, considering the title of this article—but it’s critical. Sooner or later, someone will ask: “Why not just move to a managed solution?”

We see two main axes of reflection:

Data Engineering vs YAML Engineering: Platforms abstract infra complexity, shifting engineers toward configuration and business logic. Depending on team culture, this is a blessing or a frustration. We adapted by evolving job descriptions as the framework matured.
Build vs Buy: In 2020, building was often necessary. Market solutions (Databricks, Snowflake, dbt) weren’t as mature. Today, the choice is harder. Building offers control over roadmap, stack, and costs (but requires ongoing investment). Managed platforms reduce run costs but tie you to a vendor’s roadmap and pricing.

The only way to make the right call is through KPIs. They are specific to each context, but here are some examples :

Time-to-prod evolution for use cases
Percentage of rollbacks due to platform bugs
Ratio of platform team time spent on build vs run

In the end, building gives freedom, buying gives simplicity. There’s no universal answer, but tracking the right metrics will tell you whether to double down on your custom platform or migrate to managed.