Almost every company, whether they realize it or not, has some form of customer data infrastructure, even if it’s just a Facebook Pixel or a GA4 tracker. These tools represent the basic components of a modular system designed to process customer data, which is how we define a customer data infrastructure.
The capabilities and ROI of such infrastructure is a separate question though. A basic setup using only GA4 or Tag Manager might track conversions, but it won’t enable the marketing team to engage customers across different channels and devices or personalize communication throughout the customer journey.
When we build customer data infrastructure, we aim to achieve the following business goals for our clients:
- Unified customer view: Create a single view of the customer that can be activated across all channels and devices.
- Real-time data capture: Ensure data is captured in real-time with minimal user experience impact, ideally processed server-side.
- Real-time data activation: Enable data activation across various downstream systems, such as advertising tags, throughout the customer journey.
- Access to real-time metrics: Provide real-time metrics based on the warehousing layer, integrated with reverse ETL and AI/ML solutions.
Achieving these goals requires more than just a simple tag manager and GA4, much more in fact. Demonstrating ROI from these systems is why we’ve developed custom-built data platforms and infrastructures, such as:
- Segment CDP with Metabase and Omni Analytics
- Snowplow CDP with GTM Server, Metabase, and Omni Analytics on AWS
- GA4 CDP with GTM Server, Metabase, and Omni Analytics
- Omni CDI with Metabase
In this article, we explain what customer data infrastructure is, how it relates to customer data platforms, and the key layers you need to consider when planning your customer data infrastructure to maximize ROI from it.
Customer data infrastructure is not limited to a customer data platform
One of the main questions we are asked is about the difference and relationship between customer data platforms (CDPs) and customer data infrastructures (CDIs).
The primary difference is that a customer data platform is a closed system designed to capture and activate customer data. Typical customer data platforms include Segment or Rudderstack. When working with customer data platforms, we faced several challenges that hindered our ability to maximize ROI from data for business growth. We discussed these issues in detail in a recent opinion piece.
In contrast, customer data infrastructure is typically an open-ended collection of independent components that are configured and integrated to meet specific business goals. Customer data infrastructure can include a customer data platform, as demonstrated by our CDI solutions that utilize tools like Segment, Metabase, and Omni Analytics, all in one customer data infrastructure, to provide better ROI.
Although architecturally, customer data infrastructure is often more extensive than customer data platforms, functionally, a customer data platform does not immediately have to be more limited. It can contain all the same functional layers, but these layers cannot be broken down into smaller pieces for individual customization or adaptation. Or it would be very, very costly. With customer data infrastructure, you can customize and adapt components individually. Being able to adapt and customize the customer data infrastructure is a necessary condition for existence.
At Datomni, we build customer data infrastructures that sometimes utilize customer data plaforms, and we do it to maximize ROI from data for our clients.
The layers of the customer data infrastructure
To ensure maximum ROI, customer data infrastructure should include the following elements:
- Tracking strategy
- Data collection/capture
- Data enrichment
- Identity resolution and enrichment
- Privacy and consent management
- Data activation
- Attribution
- Infrastructure
- Warehousing
- Data integration/ETL and Reverse ETL
- Data modeling
- Business intelligence
Let’s see what’s the business purpose of each layer, and what are the typical tools used there.
Tracking strategy and plan
The tracking strategy serves as a documentation and guide for other parts of the customer data infrastructure.
The core of a tracking strategy is identifying key data sources covered by the customer data infrastructure and the methods for capturing events from them. These typically include a marketing website, a backend system such as a SaaS platform, and third-party data sources like payment webhooks.
A tracking strategy should document the following:
- What events to track: Specify the events to be tracked, including their payloads, triggers, and how they are captured in code, whether client-side or server-side.
- Which properties to capture: Define the properties collected with each event, such as marketing parameters, additional identifiers, or descriptors. Properties can also include identity details.
- Why track specific events and how to use data from them: Provide reasons (business goals, that is) for tracking specific events. Assigning priorities to events and treating each set of events as pipelines that should come with their own transformations is crucial. We’ve discussed this at length in our article about the lessons learned in implementing customer data infrastructures.
- How to warehouse specific events: Specify how each event will be stored and how the corresponding data will be used in developing business metrics.
If your customer data infrastructure includes a third-party customer data platform such as Segment, then that platform will almost always have its own requirements and tracking plan specifications that you need to follow to maximize the value from the customer data platform or infrastructure.
Data collection and capture
The data collection layer involves tracking individual events generated by the data sources, and structuring them into payloads outlined in the tracking strategy.
In the data collection layer, data is captured in real-time by the CDP/CDI system. While different systems may approach this differently, events are typically collected either client-side, through tracking libraries loaded in browsers, or server-side, by capturing events by vendor-provided endpoints. Deciding what to track client-side or server-side is a separate topic, but for mission-critical events, it is generally advisable to track server-side rather than client-side.
Here are some notable solutions in this area that we have experience with:
- Amplitude: A low-code platform that provides deep behavioral reports, allowing non-technical team members to access insights easily.
- Heap: A low-code analytics software that tracks user activities and supports both qualitative and quantitative data, enabling marketers to optimize customer journeys.
- Mixpanel: A self-serve analytics platform that captures events and uses them to run real-time data reports, understand key user behaviors, and measure metrics.
- RudderStack and Segment: These CDPs come with their own data collection layers, capturing events from multiple systems, both client-side and server-side. They offer ready-made integrations and support capturing both first-party and third-party data.
- Snowplow: An open-source data collector with its Behavioral Data Platform, Snowplow captures rich, first-party customer data from multiple sources and validates it against specific schemas.
- Omni Analytics: Our own data collection platform, Omni Analytics, captures rich events, formats and structures them, and prepares them for dispatch to various data consumers and activation layers. It is a Dockerized platform that can be easily deployed in your own cloud environment and comes with numerous event source integrations
Data QA and enrichment
Once your raw events are captured, only a small subset will be ready for use in the activation layers. Most of them will be hard to use and not very useful in their initial form.
To make data ready for use, in particular for deep understanding of your target audiences, as well as for building better experiences, a process called data enrichment takes place.
Data enrichment runs in customer data infrastructures, though it may not always be implemented as a separate tool or module. Instead, it can be a feature within other tools. We emphasize data enrichment as a separate component here because of its critical role in maintaining high data quality.
Enrichment can act on data as it moves through the data collection layer or on data that has already been collected and unified. This process enriches customer profiles with additional third-party data points.
This approach is particularly important with the phase-out of cookies, which requires business focus on acquiring first-party data. Data enrichment enhances first-party data with third-party data sources, resulting in richer customer profiles, including demographic data, interests, verified contact information, and behavioral patterns.
For example, in our Pipedrive implementation package, we perform both real-time and post-processing enrichment. Real-time enrichment processes contact information such as phone numbers, email addresses, and consent details to add data points like email and phone quality before it reaches the Pipedrive account. On top of that, we also utilize Pipedrive’s Smart Data feature for in-place enrichment, adding additional data points to maximize data quality.
There are several systems for data enrichment:
- Snowplow: Implements real-time data enrichment via their Enrichment API, adding additional data points as they pass through the system.
- Omni Analytics: Our CDP platform, part of the composable, Dockerized Omni CDI, allows configuring each event pipeline to perform sequential enrichments, resulting in very rich datasets.
- Similarweb: Enriches customer profiles with data and insights about website traffic, engagement metrics, referral sources, and audience demographics.
- ZoomInfo: Focused on B2B customers, ZoomInfo enriches lead and contact data with first-party and third-party B2B data.
- SafeGraph: Specializes in location-based enrichments, providing global data sets that add location data points.
- Numerator: Offers first-party customer purchase and behavioral data from over 1 million U.S. households for a better understanding of customers.
- Affinity Solutions: Provides data about consumer spending with transaction data from more than 140 million credit and debit cards.
Identity management (resolution, enrichment, and onboarding)
Identity resolution is the process of associating multiple touchpoints of the same customer with a persistent, unified identity.
This process helps marketers reach customers on multiple devices and helps optimize advertising for lifetime value, rather than just the purchase value of the current session or the duration of cookies. Having a permanent identity linked to a person enables a complete understanding of their customer journey, which is not only cool but can inspire you to come up with campaigns for all parts of the funnel.
A key component of identity resolution is identity enrichment. This involves enhancing the standard events with additional data points related to a specific user identity. The implementation is heavily dependent on the use case and should be well-planned in the tracking strategy. It generally involves either descriptive or aggregate information about the identity performing the event. For example, enriching the flow with the total number of purchases made to date and their total value can improve ad bidding strategies.
The final component of identity resolution is identity onboarding, which is typically a feature of standalone identity platforms and is more specific to advertising technology rather than martech. Onboarding involves ingesting identity data into a third-party advertising platform and determining who should see an ad campaign. This process is a special case of data activation, which we will discuss later. The key benefit is the personalization it provides to marketers.
Identity resolution can be integrated into a standard CDP platform as a coupled layer or as an external system that manages building and maintaining identities while syncing them with downstream layers.
End-to-end CDPs typically implement identity resolution through dedicated methods in tracking and data collection SDKs, such as Segment’s or RudderStack’s “identify” method, which links specific events to a unified identity.
Other systems, like our own Omni CDI, implement identity resolution as dedicated microservices, Dockerized and securely deployed in a client’s cloud to manage and enrich identities. This system is used by default in Omni CDI implementations. Like all components of Omni CDI, it is primarily intended for deployment in private clouds.
In terms of end-to-end identity platforms which run as external platforms, one of the most powerful options is LiveRamp, which we have worked with extensively, as documented in our Adapex customer data infrastructure case study on our blog. LiveRamp helps businesses onboard, identify, connect, unify, control, and activate data across different channels and devices.
Another notable system is FullContact, which assists businesses with identity verification and building tailored customer experiences by unifying data and insights in near real-time.
Merkle’s platform offers another solution in the customer data infrastructure world by assigning a cookieless ID to anonymous visitors and then enriching this ID with person-based data, allowing for downstream activation for targeting and analysis.
Privacy and consent management
Since customer data infrastructure processes the complete lifecycle of customer data and activates it in various intermediate tools, a privacy and consent layer is a must. All events that are processed through the CDI should go through this layer, especially if they are processed downstream and linked to unified identities.
There are two main components of the privacy and consent layer:
- Consent management: Ensures that data capture aligns with user consent and privacy preferences at all times. Several third-party tools, such as Cookiebot, effectively capture and store consents in a central location for further processing.
- CDP-specific consent managers: Platforms like Segment offer consent managers that can be deployed on any website using their service. These managers capture user consent and transmit it downstream to Segment collectors.
Additionally, every customer data infrastructure should implement a privacy layer that provides data clean rooms for secure and compliant data analysis. In these environments, personally identifiable information is anonymized, processed, and stored securely. Solutions such as OneTrust, a risk management platform, help teams manage privacy, risk, data governance, and compliance. Habu, a data clean room provider, enforces high security and privacy standards while enabling collaborative intelligence without moving data.
Data activation
Data activation is the process by which real-time events and enriched data pipelines with unified identifiers are transformed into the actual business value through real-time experiences and marketing and advertising campaigns that target well-defined audiences. This is where the return on investment (ROI) of your data efforts is realized. Without activation, even the most robust customer data infrastructure will not deliver ROI. The data must be consumed and materialized in the data activation layer, turning insights into profit through personalized experiences and effective advertising.
There are multiple approaches to data activation within customer data platforms, which can vary in complexity and functionality:
- Integrated activation: In end-to-end CDP platforms like Segment, data activation is often an integrated component. These platforms typically provide prebuilt “destinations” for data activation. These destinations allow you to plug your data into various tools without extensive coding, but they require that your data collection layer captures all necessary properties and parameters for downstream activation. There’s also no way to customize them.
- Custom-built activation: For infrastructures composed of multiple components, such as those using Snowplow or Omni CDI, data activation can be custom-built using platforms like AWS Lambda or GTM Server. These platforms act as containers for implementing tagging systems. They offer flexibility but come with technical constraints, such as execution time limits (e.g., GTM Server’s 5-second limit).
- Ready-made activation platforms: These specialized tools focus solely on activating real-time data pipelines. Here is a few that we have experience with.
- Hightouch: An activation platform that uses reverse ETL to sync customer data from data warehouses, like Snowflake, into business tools such as CRMs and email systems. This approach ensures high-quality data is used in activation.
- Iterable: Offers a cross-channel communication platform that leverages AI to orchestrate dynamic customer experiences, helping to close the activation gap.
- Census: A data activation platform built on Snowflake, enabling you to activate customer data directly from your data cloud.
Attribution
Capturing consumer attention is essential for achieving ROI in marketing. One of the most powerful use cases for any data platform is engaging a specific audience at the right place and time. The goal is to determine the optimal message for a given audience, channel, and device, a process known as attribution, which is a key part of a CDP (Customer Data Platform).
As a concept, attribution is straightforward. It determines the extent to which each touchpoint contributes to a conversion. Customer data infrastructure (CDI) is ideal for attribution systems and platforms because CDI focuses on unified customer identities and their lifelong touchpoints, which is a prerequisite for accurate attribution.
The value of attribution for marketers is also clear. It helps them manage and allocate budgets across different customer segments, channels, ad formats, and messages. Effective attribution should provide estimates of expected returns from investing specific amounts of money into various touchpoints along the customer journey. The ultimate goal of attribution is to create an algorithmically optimized media plan that includes owned, paid, and earned media channels.
From our experience, companies often approach attribution modeling in a custom way, building propensity models or machine learning models based on their data warehouse information. However, there are also third-party tools available for this purpose.
One notable example is Rockerbox, which helps marketers optimize their ad spend by identifying underperforming channels throughout the customer journey. It features rule-based attribution, multi-touch attribution, halo analysis, geo lift, and more.
Another solution is Comscore, which offers cross-platform optimization of content and advertising.
Infrastructure (only open-source or Dockerized CDI’s)
Infrastructure is a critical component of any custom-built customer data infrastructure (CDI). This is not a concern if you are using an end-to-end CDP like Segment, as these systems come with pre-built infrastructure that you don’t have to worry about.
Non-SaaS-based CDP’s require you to set up an infrastructure, and it provides you with full control over data processing, typically in your own cloud, allowing you to reserve instances and save costs. The infrastructure layer generally includes several key components, depending on the open-source or custom-made customer data infrastructure you build.
Key components include collectors, which gather events; enrichers, which validate and refine events; and loaders, which transfer data downstream, particularly to the warehousing layer. These can be part of a single engine, such as Omni CDI, or can run on their own servers. When managing infrastructure, you need to prepare for potential traffic spikes, which may require implementing load balancing. For example, Snowplow-based Customer Data Infrastructure that we offer comes with three load balancers. Additionally, you also need to ensure maximum security.
Fortunately, you do not have to set up everything from scratch, as there are specific infrastructure tools that simplify the process. One tool that makes infrastructure easier to manage is Docker. For example, in Omni CDI, multiple instances of applications, including data collectors and identity platforms, run as Docker containers and automatically create the entire infrastructure.
Another solution is Terraform, an infrastructure-as-code tool that can create all necessary components. You can deploy Snowplow using Terraform or use our custom-built open-source extensions, downloaded thousands of times, to deploy Snowplow to specific infrastructures.
Warehousing
Warehousing is the process of ingesting all your enriched and validated events into a central data warehouse. It is a key component of any customer data infrastructure, essential for custom data modeling, creating dashboards with multiple metrics, and enabling data activation using reverse ETL tools. Depending on how you construct your customer data infrastructure, there are various approaches to consider.
The first approach involves end-to-end customer data platforms, which manage warehousing as part of the activation layer. These platforms allow you to ingest events into the warehouse at varying frequencies, depending on your pricing plan. Major CDP platforms are compatible with all major warehousing solutions, including BigQuery, Snowflake, PostgreSQL, and others.
Another approach, typically represented by open-source or Dockerized solutions like Omni CDI or Snowplow, treats warehousing as the final step of the processing pipeline, triggered once an event passes all validation and enrichment steps. For example, Snowplow provides both batch and real-time ingestion, with real-time ingestion currently being more popular. It supports real-time inserts into several major warehouses. Warehousing pipelines are often packaged as Terraform pipelines, and we have developed specific warehousing extensions on top of Terraform.
Another interesting solution in this area is GA4, which offers native integration with BigQuery. We covered this in one of our recent Instagram reels, and we believe it is one of the strongest selling points of GA4-based customer data infrastructure.
Data integration/ETL and Reverse ETL
ETL (Extract, Transform, Load) involves pulling raw data from third-party systems via APIs into a central warehouse. This process includes transforming data for better organization and usability.
Marketers use ETL to gain context for metrics, improve attribution modeling, and refine customer segmentation. Reverse ETL then activates this data in downstream tools.
ETL can be managed by SaaS tools like Segment, which extracts data from source systems such as Intercom, or by third-party platforms like Fivetran, which automates data movement across cloud platforms. Airbyte offers an open-source alternative for data integration.
We are shifting to our Omni CDI’s Omni ETL platform, built with Docker, which addresses data quality issues not handled by other tools. Omni ETL extracts data from platforms like Google Search Console into BigQuery, with customizable sync schemas and deep data cleansing using Python Celery.
Data modeling
After processing and inserting event data into your data warehouse, the next step is to create custom data models. These models generate metrics, clean data, and prepare it for the activation layer. They ensure that customer, marketing, and advertising data is organized and ready for analysis.
We recommend using dbt (data build tool), a framework we use and contribute to with our own packages. dbt applies software engineering best practices like testing and version control to streamline data transformation. It automates dependency management and enhances reliability using SQL or Python.
Business Intelligence and reporting
After warehousing, enriching, and ingesting your data, the next step is to implement a reporting layer.
We advise a tailored approach when choosing reporting tools. First, determine how often dashboards need refreshing: daily, in real-time, or on-demand. Second, decide how the dashboards will be used: for ad-hoc data exploration and report generation, or for static displays on company screens. Once you determine these two variables, make sure to start looking for your choice.
For frequent, on-demand exploration and custom model building, we recommend lightweight tools over large BI platforms like Tableau. Tools such as Google Charts or other charting libraries offer flexibility and simplicity for these needs.
12 steps to a pro customer data infrastructure
In this article we’ve covered what the customer data infrastructure is and which layers it actually consist of. It starts with tracking strategy and data collection, followed by data enrichment and identity resolution. Privacy and consent management ensure compliance, while data activation and attribution optimize marketing efforts. Infrastructure supports the system, warehousing stores the data, and ETL processes integrate and prepare it. Finally, data modeling and business intelligence tools provide actionable insights and reporting.
Photo attribution
As usual, the featured image of the article is a photograph that corresponds with the article’s topic. This time, the shoutout goes to Rodion Kutsaiev via Unsplash.