Introducing Omni Infrastructure with 3 open-source Terraform modules

Ever since we launched Omni CDI, our customer data infrastructure solution, a couple of months ago, our goal has been to help small and medium-sized businesses build their customer data infrastructure in their own cloud, avoiding costly external SaaS options and giving them full control over their data.

Omni CDI’s first product, Omni Analytics, which launched a month ago, is a Dockerized platform for data collection and enrichment with a simple interface. It captures unstructured or semi-structured events and processes them into enriched, structured events through filtering and aggregation, making them ready for activation (performance or warehousing).

Today, we’re launching another layer of Omni CDI—the infrastructure itself, including its key warehousing component. This brings us closer to our ultimate goal of providing full portability for customer data infrastructure.

Building infrastructure with code

To enable rapid deployment of the infrastructure layer, we use Terraform.

Here’s why we value Terraform:

  • Consistency: We have consistent infrastructures across environments by defining it as code, and eliminate manual errors.
  • Version Control: We can version infrastructure like code, enabling change tracking, collaboration, and rollback.
  • Automation: We reduce time and effort for provisioning and scaling. Obviously.
  • Scalability: We can easily scale infrastructure based on demand, optimizing resource use and costs.

Omni Infrastructure within the broader scope of Omni CDI

As outlined in our lessons learned on building customer data infrastructures, to us warehousing is a step in the data activation layer, where enriched events are transformed into business value.

Our warehousing infrastructure is based on the open-source Snowplow framework. We offer two types of Terraform modules: end-to-end pipelines, which handle the full warehousing lifecycle, including schema validation, and dedicated warehouse components for lighter setups.

The Omni Infrastructure/Warehousing pipeline also powers the real-time version of the Omni Reporting pipeline, delivering rich, granular dashboards that track business growth in real time. Omni Warehousing pipelines are supposed to consume events generated by Omni Analytics.

Introducing the Omni Infrastructure/Warehousing Terraform modules

End-to-end warehousing pipelines

The warehousing pipeline follows key steps: first, collectors capture the event. It is then validated against schemas designed to meet reporting requirements, such as business metrics. Once validated, the event is ingested into the warehouse in real time for reporting or other uses. Schema validation is the most important step to avoid bad quality data in the warehouse.

Terraform AWS Elasticsearch pipeline 

A Terraform module which deploys a pipeline to load Snowplow data into ElasticSearch using the Snowplow Open Source artefacts. This module builds the Collector application, the Enrich application and the ElasticSearch Loader.

See how to install this module in the the Infrastructure section of the Omni CDI docs

Terraform AWS Databricks pipeline 

A Terraform module which deploys a pipeline to load Snowplow data into Databricks using the Snowplow Open Source artefacts. This module builds the Collector application, the Enrich application and the Databricks Loader.

See how to install this module in the the infrastructure section of the Omni CDI docs

Warehousing components 

Our second approach provides modules for individual components. If you already have a pipeline set up, this offers additional options for scalability or enhanced reporting.

Terraform Elasticsearch cluster

This module creates a simple, single node ElasticSearch cluster on AWS.

See how to install this module in the the infrastructure section of the Omni CDI docs

What’s next for Omni Infrastructure

Omni Infrastructure is a key development within Omni CDI. We are working to expand support for more databases, with a focus on columnar storage for large analytical queries. Future enhancements include CI/CD pipeline integrations and expanded cloud support, including Google Cloud and AWS.

Photo attribution

As usual, the featured image of the article is a photograph that corresponds with the article’s topic. This time, the shoutout goes to Giorgio Trovato via Unsplash.