Snowplow Analytics

Process Snowplow DataUsing Databricks on AWS

Complete Terraform module to deploy a production-ready Snowplow pipeline with Databricks as your processing engine. Includes Collector, Enrichment, and automated data loading into your Databricks workspace.

Open Source

Apache 2.0

Complete

End-to-End

AWS

Optimized

Pipeline Components

Collector Application (Kinesis)
Enrichment Process
Databricks Loader
S3 Pipeline Storage
Automatic Error Recovery
Kinesis Stream Management

Pre-configured Infrastructure

Includes VPC configuration, load balancers, EC2 instances, and all necessary AWS services for a production-ready Snowplow pipeline.

Snowplow Pipeline Challenges

Common obstacles when setting up Snowplow with Databricks processing

Complex Pipeline

Setting up Snowplow collectors and enrichment is challenging

Impact: Weeks of pipeline architecture planning

Solution: Pre-configured Snowplow components

Data Integration

Connecting Snowplow to Databricks requires expertise

Impact: Data loading gaps and inconsistencies

Solution: Automated Databricks loader setup

Stream Processing

Managing Kinesis streams and throughput

Impact: Data processing delays and bottlenecks

Solution: Optimized stream configuration

Error Handling

Lost events and failed enrichments

Impact: Missing or corrupted analytics data

Solution: Built-in error recovery flows

Pipeline Components

Production-ready Snowplow pipeline with Databricks integration

Complete Pipeline

End-to-end Snowplow pipeline with Databricks integration.

Collector Application
Enrichment Process
Databricks Loader
S3 Pipeline Storage

AWS Infrastructure

Production-ready AWS components with optimal configuration.

Kinesis Streams
Load Balancer (ALB)
EC2 Instances
VPC Integration

Built-in Security

Enterprise-grade security and access controls.

IAM Roles & Policies
Private Subnets
Security Groups
SSL/TLS Support

Data Processing

Optimized data flow with error handling and monitoring.

Real-time Processing
Error Recovery
Bad Event Handling
Kinesis Throughput Control

Technical Architecture

Core components of the Snowplow Databricks pipeline infrastructure

Pipeline Components

Snowplow Collector (Scala Stream)
Enrichment Process (Stream)
Kinesis Data Streams
S3 Raw/Enriched Buckets
Databricks Loader
Bad Events Storage

AWS Resources

Application Load Balancer
Auto Scaling Groups
EC2 Instances
CloudWatch Monitoring
VPC with Private Subnets
Route53 DNS (Optional)

Security Controls

IAM Service Roles
Security Group Rules
KMS Encryption Keys
SSL Certificate
Private Subnet Isolation
Network ACLs

Installation Guide

Deploy your Snowplow pipeline with Databricks integration in minutes

1. Add Module

module "snowplow_pipeline" {
  source = "github.com/Datomni/terraform-aws-snowplow-databricks-pipeline"
  
  aws_region          = var.aws_region
  environment         = var.environment
  vpc_id             = var.vpc_id
  private_subnet_ids = var.private_subnet_ids
  
  # Snowplow Collector
  collector_min_size = 1
  collector_max_size = 2
  
  # Enrichment
  enrichment_min_size = 1
  enrichment_max_size = 2
}

Add this configuration to your main.tf file

2. Configure Variables

variable "aws_region" {
  type    = string
  default = "us-west-2"
}

variable "environment" {
  type    = string
  default = "production"
}

variable "vpc_id" {
  type = string
  description = "VPC ID where the pipeline will be deployed"
}

variable "private_subnet_ids" {
  type = list(string)
  description = "List of private subnet IDs for deployment"
}

Define these required variables in your variables.tf file

3. Deploy

# Initialize Terraform and download the module
terraform init

# Review the infrastructure plan
terraform plan

# Deploy the pipeline
terraform apply -auto-approve

Execute these commands to deploy your Snowplow pipeline with Databricks integration