Customer data platforms: practical hard-earned lessons

Over the past 2 years, we’ve implemented customer data platforms in various forms, including Snowplow with Metabase, GA4-based CDP, Omni CDI, and Segment CDPs.

We’ve worked on CDPs for companies processing tens of millions of events such as Adapex, as well as for scaling brands the UK’s Vyde, which applied a customer data platform to a very specific business problem.

We’ve also developed several in-house tools on top of these platforms, such as the Segment WordPress Plugin, Terraform components for infrastructure (downloaded over 1000 times), and metric models.

Since the initial fascination with customer data platforms (CDPs) has already evolved into informed optimism, I’m eager to share my seasoned insights on what truly drives ROI in CDP projects. In this article, I will also address some of the pain points we’ve encountered and explain steps to alleviate them.

I write this from the context of mid-sized growth brands, our typical clients. They are ambitious, budget-conscious, agile, and prefer full control over their data and pipelines rather than relying on the latest SaaS solutions, even if they’re really shiny and are promoted with engaging ads. They have small (or no) analytics teams, value quick execution, and are keen to understand the technical details.

Challenges in the real-life CDP implementations

The choice between vendor lock-in or massive infrastructure

When selecting a CDP, I often felt caught between two realities: one involving significant monthly costs, typically with SaaS-based CDPs, and the other requiring substantial team and infrastructure resources, typical for extensive open-source CDP platforms, with little middle ground and room to play with, especially for smaller companies. On top of that, shifting between these realities is rarely quick and easy.

This made me realise that it’s essential to understand that CDP platforms deeply integrate into businesses. Choosing a CDP means not only adopting its technology into your codebase but also embedding it into company processes and possibly even culture. This commitment results in high switching costs—financially and psychologically—as sentiments towards a CDP platform can change. I’ve observed this shift due to recurring data transfer costs linked to user numbers, inability to customize certain features, or difficulty proving business value from using a CDP. There will be vendor lock-in, so choose the vendor you want to be locked in with, knowing that transitions to a new CDP platform will take around 2-3 months, possibly shorter with an experienced team. There are a few key assumptions and lessons you can make that will help your choice easier.

First, future user counts and data sources are really uncertain, so your CDP should accommodate spikes and changing morphology in the marketing and analytics stack from which you want to feed data to your CDP for downstream activation or warehousing.

CDP tracking SDKs look ready-made, but they will need significant adaptation to fully work with your destinations and backends. While initial scripts can probably be deployed quickly, the deep integration, which, for example, identifies unique users or tracks user profiles, typically cuts very deep into backend, system API, or even database.

Your team will treat CDP as a part of their codebase. The long-term sentiment towards a CDP in your company depends on how strong the urge to build things is in your team. This urge is fueled by the need to control operations, integrate deeper monitoring systems into the CDP pipelines, build internal know-how, and understand what’s going on in general. If the urge to build things in your team is strong, they will feel confined into a black-box CDP SaaS. For us, we felt that urge even since we implemented our first CDP. Once we had the resources to play with them and started using platforms like Snowplow, other tools initially seemed like toys. It taught us the complexities of load-balancing traffic, choosing and scaling warehouses for large event volumes, and the importance of custom contexts and event validation schemas.

A team’s skills must match the complexity of the CDP platform you plan to implement. In particular, operating an open-source CDP in your cloud grants control over your entire infrastructure and data, but this control comes at the cost of allocating developer resources for infrastructure maintenance and operations. These platforms have steep learning curves and are very broad and expansive in terms of the scope of cloud resources used, demanding a significant time investment. Just because you can easily deploy some of these solutions using Terraform (such as one of our modules linked above) doesn’t mean they will be just as easy to run and maintain; understanding their internals is crucial. Once set up, you’ll want to customize them to fit your needs (it’s open source, after all), which is a significant challenge and also requires a dedicated team. If you have that team, you will be happy, but what if you don’t?

Among all the companies that we’ve worked with, some companies matched perfectly with one of the two realities because it checked all the boxes for them, while others kept looking for better solutions. Wanting to understand why, we’ve found a number of common denominators. Typically, these companies are data-mature enough to consider a private customer data platform, typically having tested a number of CDPs in the past, but they lack the extensive analytics teams needed to run some of the more advanced open-source CDPs, while some of the well-known SaaS-based CDPs will probably bore and annoy them with their limitations. They naturally prefer running operations exclusively on their own cloud, maintaining complete data control, having autonomy over their systems, customizing systems to fit exact needs, and learning from their own mistakes. They want their infrastructure to be easily portable because portability allows for experimentation with different stacks. By the way, it was these characteristics or missing elements that initially prompted us to develop Omni CDI, our customer data infrastructure designed for deployments in private clouds, which aims to meet these requirements.

Trying to force the concept of source and destination onto a reality that doesn’t fit it

From what I’ve seen, successfully using CDPs and event pipelines is akin to how oil companies approach oil extraction. They first explore and drill to identify quality resources, followed by a detailed refinement process tailored to the specific raw material. Each resource undergoes rigorous treatment to remove impurities and ensure only high-quality products are distributed. They vigilantly monitor this entire process. This is not a typical approach for CDPs these days.

In a typical CDP implementation these days, raw events are asserted to be useful and flow directly from sources to destinations in one direction. Advanced implementations may use event filters or mappings, and a lot of schema validation, but they still basically use raw data that has just appeared from the sources. This approach is flawed.

CDPs are enclosed systems and subject to the laws of entropy, so allowing raw data to flow freely through the system leads to chaos, and unless steps are taken, this chaos will increase over time, causing data quality downstream to deteriorate. This deterioration impacts the original value-add of CDPs, which is real-time, accurate activation, and the ability to interact with customers across channels in real-time.

An effective CDP implementation must recognize that there exists a hierarchy of events, and that hierarchy is determined by the business goal they are made to serve. Some events, if incorrect, could derail the entire company, while others have little business value and are merely informative. At different levels of this hierarchy, the processing must be different. The implementation and platform should support multiple pipelines and their vastly different morphology depending on the event “rank”, all while running in real-time without resorting to solutions like Reverse ETL.

For instance, payment or signup events need fraud detection, email validation, and LTV estimation before being fed to any destination, while web visit or button click events only need basic filtering and schema validation. Some events must be fully anonymized, while others do not. Some need consent enrichment. Third-party events often require complete restructuring, cleaning, and filtering to be useful, as they are generated by external companies unaware of your specific business needs for the event at hand.

While we managed to alleviate some of these challenges by adapting certain CDPs to the variety of events and event pipelines through categorizing events into prioritized groups and routing them to dedicated sources (sinks) or collectors, and treating them uniquely at both the collection and ingestion stages using tools such as custom source or destination functions or the enrichment API, it was always cumbersome and code-intensive, or frankly quite ugly. We also attempted to move these enrichment and validation processes upstream of the ingestion stage into the CDP, but this required hardcoding a lot of data processing logic into the client system. This compounded the already significant development overhead of CDP projects. Making it universal and portable was challenging, motivating us to explore our own approaches to this problem.

Here comes another shameless plug for Omni CDI, our customer data infrastructure designed for deployments in private clouds. At the core of our own developments lies the belief that raw and unprocessed events cannot inherently be trusted and are noise that may or may not carry some signal. The purpose of a CDP is not to process everything but primarily to process signal and remove noise with each consecutive step. The path to achieving this goal involves reducing event dimensionality and producing rich, well-structured payloads at the outset of the CDP process. Maybe you actually don’t need all these 120 properties on every single event.

I believe that the ideal model for a CDP is to be a flexible container accommodating various processing pipelines tailored to the diversity events or event groups, rather than merely a source-destination powerhouse.

Analytics iterations are too slow, leading to low or negative ROI

This will, of course, depend on the size of your team and how well you cooperate, but it’s not just that—hear me out.

When done quickly, accurate analytics have a multiplier effect and create a strong compounding loop. All things being equal, more effective and targeted analytics lead to better outcomes per unit of resources invested. Whether achieving cheaper conversions through high-quality ad matching data fed to Meta Conversion API from your online or offline conversions or generating higher repurchase values by running a real-time SMS campaign for lost visitors, it’s about making more of the resources you have.

For analytics to be done quickly, there must be a laser focus on implementing a few very specific events and pipelines known to unlock value or efficiency—not trying to catch a goldfish with a massive net, collecting all the garbage along the way. Then these top-ranking events must be implemented using a proper process.

We don’t see analytics being done fast, and we had some challenges with this aspect ourselves. We realised that CDP implementation needs a more efficient process too. By design, the currently most renowned CDP platforms implement analytics in phases. This starts with creating tracking plans, optionally configuring them in data validation systems such as custom schema managers or event protocol validators, through event implementation, destination configuration, and infrastructure development. This is the minimum scope. Each step requires coordination involving at least 2 or 3 people, often more. Miscommunication or misunderstanding at any step propagates across multiple phases. Once derailed, it’s challenging to get the project back on track. Providing ROI quickly in this situation is simply very difficult. The solution is to break the silos and have the analytics pipeline implemented completely by a maximum of 3 people—dev-oriented, analytics and infra-oriented, and possibly a marketer to make sense of all this data for growth.

What we’ve found in the current state of this that platforms don’t prompt prioritization of data, pipelines, and events, and instead push tracking everything that moves. On the surface they may say they do, but their business models promote tracking as much as possible. The inability to pinpoint the singular or top few extremely specific business cases for analytics optimization leads to a lack of momentum in these projects. These priorities must be determined with surgical precision, not lukewarm.

Again, I’m writing this from the perspective of a small growth brand—our typical client. In other cases, the reception of this process may differ, especially with a large analytics team on board.

When we set out to create some of our CDP tools, our goal was clear: Can we speed up CDP projects and make them configurable and runnable with smaller teams, reducing dependencies and increasing individual responsibilities? Our aim was also to lower costs to facilitate broader adoption of real-time analytics among companies.

Warehousing tied to raw data and often not real-time

We don’t entirely agree with the ways current SOTA CDP systems approach warehousing.

Every major CDP, whether open-source or SaaS, includes warehousing. In open-source tools, warehousing the data was typically the original motivation for creating these platforms, and the pipeline architecture was primarily designed to meet warehouse needs. These legacy decisions carry over to the real-time activation world we live in now, and you can feel it when you dig deep into these systems (although this is changing for some platforms at the moment). For these platforms, the warehouse is paramount, and clean warehouse data is a top priority—data must undergo rigorous validation before ingestion, treating activation of this data with secondary priority.

The alternative view in other CDPs is that warehouses function merely as destinations linked to any data source—whether batch data sources or real-time data streams. In this approach, a CDP warehouses everything you feed it without much consideration. Sure, with a high-priced plan, you can add some filters or event validations, but no small brand can afford to pay thousands per month for a CDP. This approach is also problematic and can easily disrupt or degrade data models. How many times have we seen a situation where a developer updates a tracking schema without proper communication with data modelers or warehouse administrators, inadvertently altering the warehouse schema and affecting downstream metrics?

Neither approach is ideal in my opinion—warehousing is neither a standard destination nor the ultimate purpose of a CDP (which exists primarily to activate data in real-time as we see it).

We assert that warehousing is a crucial part of the data activation process, but distinct from standard destinations (also a part of the activation). Proper warehousing begins with clean and enriched data in the activation layer (rather than raw data captured by collectors) and requires specialized treatment, including ensuring all events are validated before being warehoused. This approach allows for rapid modifications in data collection for downstream destinations, while minimizing the risk of inadvertently disrupting the warehouse’s data integrity or downstream metrics with changed payloads.

Another frustrating limitation we encounter is the batched data sync offered by many CDP platforms, which restricts syncing to a few times a day (especially in lower pricing plans). Whether small or large, businesses deserve real-time data access for timely decision-making in 2024, which is only achievable with real-time warehousing and streaming inserts.

Downstream activation (data destinations) need data points which are not tracked automatically

SaaS CDP platforms make it very easy to connect a data flow to certain destinations, often with just a click of a button. However, this ease of configuration does not automatically ensure that your data flow includes all the necessary event attributes required for the destination to function properly.

For example, our clients are often very surprised when using the Meta Conversion API via a SaaS CDP. This usually comes to light when configuring the ingestion of offline events for full-funnel optimization within the Meta API ecosystem, such as backend or third-party event flows, whose payloads do not natively carry a number of very important advertising identifiers that the Meta API uses for matching users of their platforms with conversions. Although these events can be technically plugged into and ingested by the Meta Conversion API very easily in most SaaS CDP systems, they first need significant enrichment, structuring, and cleaning before they can be used effectively and provide good matching rates. Without this enrichment, businesses are deprived of the ability to optimize their conversion journeys for the bottom parts of their funnels, and this adds to the confusion about the CDP system.

This implementation challenge is caused by two things.

The first is fundamentally an architectural issue and design choice that haunts a number of CDP systems and maps back to what we’ve already mentioned: event flows are asserted to be useful by default and thought to be ready to be plugged into the destinations right away. As we’ve already established, this is a false premise, especially when it comes to offline and third-party event flows. Data processing architecture must allow for event flows to undergo a multi-step refined cleaning, structuring, and enrichment process before having any chance of being fed to the destinations. You can achieve a simplified version of the solution to this problem using custom enrichment functions and APIs that CDP platforms provide, but the viability of this solution will depend on the task at hand and is generally code-intensive.

The second issue is that tracking SDKs of various CDP platforms only carry general, not destination-specific, data points. Destination-specific data points must typically be captured using custom scripts that extract data from the session cookies or other storage options and append these data points to the event payload being sent to the CDP collector. For this, you typically need to do thorough homework at the tracking plan stage and ensure that you’re capturing the latest event payloads that are designed to ingest rich event data into the destinations.

Make your CDP robust with our lessons learned

This article shares our insights and pain points regarding CDP implementations, especially for mid-sized growth brands that prefer autonomy and are mindful of the costs associated with running a CDP. We’ve identified key challenges, such as the high cost of vendor lock-in, the need for significant infrastructure, and the complexities of adapting CDP tracking SDKs.

Effective CDP implementations must recognize that there’s a hierarchy to events and ensure clean, enriched data for real-time activation. We believe the ideal CDP model is a flexible container accommodating diverse processing pipelines tailored to specific groups of events by their importance and level in the hierarchy. CDP implementations can only provide a significant positive ROI if used within fast analytics iterations serving marketing teams.

By addressing these challenges and focusing on proper processes around analytics, even small teams can achieve CDP nirvana, activate customers at any stage of their journey, build amazing experiences, and make full use of their data.

Disclaimers

In this article, I do not present a comprehensive product recommendation, comparison or market review, but rather a collection of subjective experiences. These are generalizations based on a statistically small and somewhat biased sample. Depending on your specific business type and context, these lessons may apply to a lesser extent or not at all. Always do your own research. All brand names mentioned belong to their respective owners. We are not affiliated with these brands except as partners in some cases.

Photo attribution

As usual, the featured image of the article is a piece of abstract photography that corresponds with the article’s topic. This time, the shoutout goes to Pawel Czerwinski via Unsplash.