Snowplow Analytics

Process Snowplow DataUsing Databricks on AWS

Complete Terraform module to deploy a production-ready Snowplow pipeline with Databricks as your processing engine. Includes Collector, Enrichment, and automated data loading into your Databricks workspace.

Open Source
Apache 2.0
Complete
End-to-End
AWS
Optimized

Pipeline Components

  • Collector Application (Kinesis)
  • Enrichment Process
  • Databricks Loader
  • S3 Pipeline Storage
  • Automatic Error Recovery
  • Kinesis Stream Management
Pre-configured Infrastructure

Includes VPC configuration, load balancers, EC2 instances, and all necessary AWS services for a production-ready Snowplow pipeline.

Snowplow Pipeline Challenges

Common obstacles when setting up Snowplow with Databricks processing

Complex Pipeline

Setting up Snowplow collectors and enrichment is challenging

Impact: Weeks of pipeline architecture planning

Solution: Pre-configured Snowplow components

Data Integration

Connecting Snowplow to Databricks requires expertise

Impact: Data loading gaps and inconsistencies

Solution: Automated Databricks loader setup

Stream Processing

Managing Kinesis streams and throughput

Impact: Data processing delays and bottlenecks

Solution: Optimized stream configuration

Error Handling

Lost events and failed enrichments

Impact: Missing or corrupted analytics data

Solution: Built-in error recovery flows

Pipeline Components

Production-ready Snowplow pipeline with Databricks integration

Complete Pipeline

End-to-end Snowplow pipeline with Databricks integration.

  • Collector Application
  • Enrichment Process
  • Databricks Loader
  • S3 Pipeline Storage

AWS Infrastructure

Production-ready AWS components with optimal configuration.

  • Kinesis Streams
  • Load Balancer (ALB)
  • EC2 Instances
  • VPC Integration

Built-in Security

Enterprise-grade security and access controls.

  • IAM Roles & Policies
  • Private Subnets
  • Security Groups
  • SSL/TLS Support

Data Processing

Optimized data flow with error handling and monitoring.

  • Real-time Processing
  • Error Recovery
  • Bad Event Handling
  • Kinesis Throughput Control

Technical Architecture

Core components of the Snowplow Databricks pipeline infrastructure

Pipeline Components

  • Snowplow Collector (Scala Stream)
  • Enrichment Process (Stream)
  • Kinesis Data Streams
  • S3 Raw/Enriched Buckets
  • Databricks Loader
  • Bad Events Storage

AWS Resources

  • Application Load Balancer
  • Auto Scaling Groups
  • EC2 Instances
  • CloudWatch Monitoring
  • VPC with Private Subnets
  • Route53 DNS (Optional)

Security Controls

  • IAM Service Roles
  • Security Group Rules
  • KMS Encryption Keys
  • SSL Certificate
  • Private Subnet Isolation
  • Network ACLs

Installation Guide

Deploy your Snowplow pipeline with Databricks integration in minutes

1. Add Module

module "snowplow_pipeline" {
  source = "github.com/Datomni/terraform-aws-snowplow-databricks-pipeline"
  
  aws_region          = var.aws_region
  environment         = var.environment
  vpc_id             = var.vpc_id
  private_subnet_ids = var.private_subnet_ids
  
  # Snowplow Collector
  collector_min_size = 1
  collector_max_size = 2
  
  # Enrichment
  enrichment_min_size = 1
  enrichment_max_size = 2
}

Add this configuration to your main.tf file

2. Configure Variables

variable "aws_region" {
  type    = string
  default = "us-west-2"
}

variable "environment" {
  type    = string
  default = "production"
}

variable "vpc_id" {
  type = string
  description = "VPC ID where the pipeline will be deployed"
}

variable "private_subnet_ids" {
  type = list(string)
  description = "List of private subnet IDs for deployment"
}

Define these required variables in your variables.tf file

3. Deploy

# Initialize Terraform and download the module
terraform init

# Review the infrastructure plan
terraform plan

# Deploy the pipeline
terraform apply -auto-approve

Execute these commands to deploy your Snowplow pipeline with Databricks integration