Snowplow run cost: how high?

The Snowplow infrastructure, or pipeline, is a key component of our Snowplow-based customer data infrastructure. This infrastructure includes the pipeline itself, along with the GTM Server or AWS Lambda in the activation layer and Metabase in the reporting layer. We’ve deployed it numerous times, including in one of our latest projects for the US-based company Adapex, an ‘Inc. 5000’ company. They use Datomni’s Snowplow-based CDI to capture first-party event and identity data at scale, with LiveRamp in the activation layer. One of the most commonly asked questions by our clients is how much they will pay for their Snowplow pipeline when it is in production.

If you’ve been following our blog, you know we like to write in-depth articles that can be quite long. We aim to provide full context so you can make informed decisions and build awareness. Instead of just offering quick tips, we focus on delivering well-seasoned insights that come from our experience.

This article, however, is different. We’re skipping the nitty-gritty details and background information about Snowplow to answer your specific question: How much is Snowplow going to cost me? Let’s get right into it.

Unit costs for the resources needed to run Snowplow

In this outline, we’re assuming a real-time Snowplow pipeline running on AWS with Postgres for warehousing. All recommendations assume you fit our ideal company profile for implementing Snowplow, meaning you have around 10-15 million events per month from multiple data sources with real-time reporting and activation needs. We believe this is the sweet spot where Snowplow really starts to make sense.

Iglu

Iglu helps validate events going through your Snowplow pipelines against the set of predefined schemas. 

Iglu server

The Iglu server handles queries for schema definitions stored within the schema registry, providing access to Iglu schemas upon request.

Recommended resource: 1 x Auto-scaling group (capacity 1) – t3.micro, gp2 10 GB storage.

Costs depend on the number of EC2 instances, instance type, and amount of storage. Approximately $30 per month.

Iglu load balancer

The Iglu load balancer, one of two Snowplow pipeline load balancers, which balances and routes traffic to the Iglu Server.

Recommended resource: 1 x Application Load Balancer

The cost of an Application Load Balancer (ALB) is based on the number of Load Balancer Capacity Units (LCUs) used per hour, including partial hours. LCUs are measured across four components: new/active connections, processed bytes, rule evaluations, and the region of deployment.

We estimate the minimum cost to be approximately $20, but it can increase significantly with higher event volumes. ALB is charged at $0.008 per LCU hour. For more details, visit AWS ALB Pricing.

Iglu database

Recommended resource: 1 x RDS Database (db.t3.micro, 10 GB storage).

Costs depend on the instance type and storage used. For example, a db.t3.micro, which we recommend here, with 10 GB of storage, would cost approximately $15 per month.

Pipeline

The Snowplow pipeline processes events throughout their (rather short-lived…) short lifecycle: from being captured in the collector, to being enriched and validated, and finally being loaded into the warehouses.

Pipeline load balancer

This application load balancer directs incoming HTTP(S) traffic to the Collector instances. (See above for details on this resource).

Recommended resource: 1 x Application Load Balancer.

Pipeline: collector, enricher, loaders

Snowplow pipeline needs to following EC2 resources:

  • Collector: EC2 instance which receives raw Snowplow events over HTTP(S), serializes them, and writes them to Kinesis.
  • Enricher: EC2 instance that reads raw Snowplow events, validates and enriches them, then writes the enriched events to another stream.
  • RDS Enriched Loader: EC2 instance for loading enriched events into RDS. 
  • RDS Bad Loader: EC2 instance for loading bad events into RDS. 
  • S3 Bad Loader: EC2 instance which reads from the bad event streams and writes to the bad event S3 folders.
  • S3 Raw Loader: EC2 instance for reading and writing raw events into S3.
  • S3 Enriched Loader: EC2 instance for reading and writing enriched events into S3.

Recommended resources:

  • 7 x Auto-scaling group (capacity 1) –  t3.micro, gp2 10 GB storage.
  • 1 x S3 bucket. The cost depends on the amount of data stored and the storage class used. For one bucket, the cost is $0.023 per GB. PUT and GET request costs depend on the number of requests made.

Storage

The main warehouse to capture your processed events and DynamoDB which checkpointing to track which data has been consumed from the Kinesis stream.

Recommended resources:

  • 1 x RDS Database (db.t3.medium + 500 GB).

Using a db.t3.medium Postgres instance with 500 GB of storage would cost approximately $360 per month.

  • 7 x DynamoDB table (provisioned read capacity 1 + write capacity 1)
    • 1 x Dynamodb read capacity auto-scaling (min 1, max 50)
    • 6 x Dynamodb read capacity auto-scaling (min 1, max 10)
    • 7 x Dynamodb write capacity auto-scaling (min 1, max 50)

The cost of each DynamoDB table depends on the provisioned read/write capacity. One hour of provisioned RCU costs around $0.00013, while one hour of provisioned WCU costs $0.00065. In general, it may be difficult to estimate since provisioned capacity scales up and down depending on need, ranging from 1 to 50. However, the case study provided below should give you an idea.

Streams

We need four streams to run Snowplow pipeline:

  • Raw event stream: Captures raw collector payloads.
  • Enriched event stream: Captures validated and enriched events.
  • Bad 1 stream: Captures events that fail during processing by the Collector, Enrich, or Loaders.
  • Bad 2 stream: Captures events that fail during the transfer from Bad 1 stream to the S3 bad event folder.

Recommended resource: 4 x Kinesis stream (1 shard, 24h retention).

The cost of a Kinesis stream depends on the number of shards, the retention period, and the region. 

For one shard stored in the US, we estimate the cost should be around $80-$90.

Case study: the cost of running Snowplow on AWS with Postgres

As you’ve seen in our Snowplow-based customer data infrastructure page, it’s designed for relatively high-traffic businesses. We’ve calculated that Snowplow really starts to make sense at around 10-15 million events per month. Below is a case study based on this pricing and assuming a configuration for around 15 million events per month. However, all estimates will be provided for the case of 1 million processed events.

Cost per 1 million events

  • 2 load balancers (pipeline, Iglu) x 0.0225 USD per hour x 730 hours in a month = 32.86 USD
  • S3: A bit overestimating, but we could say 1 million events would be: 3500 PUT requests and 0.5 GB data = 0.03 USD
  • 45 LCUs x 0.008 LCU price per hour x 2 (iglu server + collector) = 3.6 USD
  • 8 t3.micro EC2 instances x 0.0104 USD On-demand hourly cost x 730 hours in a month = 60.74 USD
  • Kinesis:
    • 2,820.00 shard hours (4 shards) x 0.015 USD = 43.80 USD
    • 2 million PUT Payload Units (1 million for raw and 1 million for enriched) x 0.000000014 USD = 0.03 USD
  • DynamoDB:
    • First 25 GB data stored is free per month, 0.25 USD per GB-month thereafter.
    • Provisioned capacity:
      • 7 tables x 1 provisioned WCU x 730 hours in a month x 0.00065 USD = 3.32USD
      • 7 tables x 1 provisioned RCU x 730 hours in a month  x 0.00013 USD =0.67USD
  • RDS (iglu server) = 28.58 USD
  • RDS (pipeline) = 220.85 USD

Total estimate = ~395 USD per month.

Cost estimates will vary depending on the actual volume of events. Increases are expected in the following areas:

  • S3 storage (above estimate can be multiplied by x million events)
  • ALB LCUs (above estimate can be multiplied by x million events)
  • Kinesis PUT Payload Units (above estimate can be multiplied by x million events)
  • DynamoDB provisioned capacity (depending on scaling, it can range from 4 – 40 USD, but the above estimate of 1wcu/rcu per table should hold correct until ~15 million events per month)

Closing remarks

Everyone needs to determine if the Snowplow is expensive or not themselves based on the business process it will serve. In our opinion, if you handle at least 10-15 million events per month and have a well-planned customer data infrastructure with targeted real-time event pipelines, Snowplow is cost-effective. Our article on lessons learned from building customer data infrastructures discusses the importance of planning your CDP in detail.

At Datomni, to ensure high ROI from running on Snowplow, we’ve integrated the Snowplow pipeline into a comprehensive customer data infrastructure that offers more than just data processing. Our custom-built CDI can replace multiple SaaS tools for data processing and server-side tagging while providing full access to raw data and running everything in your private cloud.

For smaller companies with lower event volumes, we recommend more nimble solutions such as our Omni CDI, Segment, or GA4-based CDPs.

Photo attribution

As usual, the featured image of the article is a photograph that corresponds with the article’s topic. This time, the shoutout goes to Adrien Converse via Unsplash