Skip to content

🚀 Deployment Guide

Prerequisites

System Requirements

  • OS: Linux (Ubuntu 20.04+), macOS, or Windows with WSL2
  • RAM: Minimum 8GB, Recommended 16GB+
  • CPU: Minimum 4 cores, Recommended 8+ cores
  • Storage: Minimum 50GB free space
  • Network: Internet connection for Docker image downloads

Software Requirements

  • Docker: Version 20.10+
  • Docker Compose: Version 2.0+
  • Git: For cloning the repository
  • curl: For health checks

Installation Commands

Ubuntu/Debian

# Update package list
sudo apt update

# Install Docker
sudo apt install -y docker.io docker-compose-plugin

# Install Git
sudo apt install -y git curl

# Start Docker service
sudo systemctl start docker
sudo systemctl enable docker

# Add user to docker group
sudo usermod -aG docker $USER

macOS

# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install Docker Desktop
brew install --cask docker

# Install Git
brew install git curl

Windows (WSL2)

# Install Docker Desktop for Windows
# Download from: https://www.docker.com/products/docker-desktop

# Install Git
winget install Git.Git

# Install curl
winget install cURL.cURL

Quick Start Deployment

1. Clone Repository

git clone <repository-url>
cd iceberg_data_engineering

2. Environment Setup

# Create environment file (optional)
cp .env.example .env

# Edit environment variables if needed
nano .env

3. Start Services

# Start all services
docker-compose up -d

# Check service status
docker-compose ps

4. Verify Deployment

# Check service health
curl -f http://localhost:8080/v1/info  # Trino
curl -f http://localhost:3030/health   # Dagster
curl -f http://localhost:9000/minio/health/live  # MinIO

Detailed Deployment Steps

Step 1: Infrastructure Services

PostgreSQL Database

# Start PostgreSQL
docker-compose up -d postgres

# Wait for database initialization
docker-compose logs -f postgres

# Verify connection
docker exec -it postgres psql -U postgres -c "SELECT version();"

MinIO Object Storage

# Start MinIO
docker-compose up -d minio minio-setup

# Wait for bucket creation
docker-compose logs -f minio-setup

# Verify buckets
docker exec -it minio-setup-bucket /usr/bin/mc ls minio/

Step 2: Metadata Services

Hive Metastore

# Start Hive Metastore
docker-compose up -d hive-metastore

# Wait for initialization
docker-compose logs -f hive-metastore

# Verify metastore
docker exec -it hive-metastore hive --service metastore --version

Apache Ranger

# Start Ranger
docker-compose up -d ranger

# Wait for initialization (may take 5-10 minutes)
docker-compose logs -f ranger

# Verify Ranger UI
curl -f http://localhost:6080

Step 3: Processing Services

Apache Spark

# Start Spark cluster
docker-compose up -d spark-master spark-worker spark-worker-b spark-worker-c

# Verify Spark cluster
docker exec -it spark-driver spark-submit --version

# Check Spark UI
curl -f http://localhost:8081

Trino Query Engine

# Start Trino
docker-compose up -d trino

# Wait for initialization
docker-compose logs -f trino

# Verify Trino
curl -f http://localhost:8080/v1/info

Step 4: Orchestration and Analytics

Dagster Orchestration

# Start Dagster
docker-compose up -d dagster

# Wait for initialization
docker-compose logs -f dagster

# Verify Dagster UI
curl -f http://localhost:3030

Apache Superset

# Start Superset
docker-compose up -d superset-bi

# Wait for initialization (may take 5-10 minutes)
docker-compose logs -f superset-bi

# Verify Superset UI
curl -f http://localhost:8088

Hue SQL Interface

# Start Hue
docker-compose up -d hue

# Wait for initialization
docker-compose logs -f hue

# Verify Hue UI
curl -f http://localhost:8888

Configuration

Environment Variables

Core Configuration

# MinIO Configuration
MINIO_ENDPOINT=http://minio:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
S3_BUCKET_LIST=datalake,logger,warehouse

# Database Configuration
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgrespassword
HIVE_DB_USER=hiveuser
HIVE_DB_PASSWORD=hivepassword
RANGER_DB_PASS=rangerpassword
HUE_DB_PASS=huepassword

# Warehouse Configuration
WAREHOUSE_DIR=s3a://warehouse/

Service-Specific Configuration

# Spark Configuration
SPARK_MODE=master
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=1G
SPARK_EXECUTOR_MEMORY=1G
SPARK_DRIVER_MEMORY=1G

# Ranger Configuration
RANGER_SERVICE_NAME=hivedev
RANGER_POLICY_REST_URL=http://ranger:6080
RANGER_POLICY_CACHE_DIR=/tmp/ranger
RANGER_POLICY_POLL_INTERVAL=60000

Service Ports

Service Port Protocol Purpose
Trino 8080 HTTP SQL query interface
Dagster 3030 HTTP Workflow orchestration
Superset 8088 HTTP BI dashboard
MinIO 9000 HTTP Object storage API
MinIO Console 9001 HTTP Storage management
Hue 8888 HTTP SQL query interface
Ranger 6080 HTTP Security management
Spark Master 8081 HTTP Spark cluster UI
Spark Worker 1 8082 HTTP Worker UI
Spark Worker 2 8083 HTTP Worker UI
Spark Worker 3 8084 HTTP Worker UI
PostgreSQL 5432 TCP Database
Hive Metastore 9083 TCP Metadata service

Service Management

Starting Services

# Start all services
docker-compose up -d

# Start specific service
docker-compose up -d trino

# Start with logs
docker-compose up trino

Stopping Services

# Stop all services
docker-compose down

# Stop specific service
docker-compose stop trino

# Stop and remove volumes
docker-compose down -v

Restarting Services

# Restart all services
docker-compose restart

# Restart specific service
docker-compose restart trino

# Force recreate service
docker-compose up -d --force-recreate trino

Monitoring Services

# View service status
docker-compose ps

# View service logs
docker-compose logs -f trino

# View resource usage
docker stats

# Check service health
docker-compose exec trino curl -f http://localhost:8080/v1/info

Data Initialization

1. Create Initial Tables

# Connect to Trino
docker exec -it trino trino --server http://localhost:8080

# Create databases
CREATE SCHEMA IF NOT EXISTS iceberg.asset_property;
CREATE SCHEMA IF NOT EXISTS iceberg.flight_radar;
CREATE SCHEMA IF NOT EXISTS iceberg.ecommerce;

2. Run Initial Data Pipeline

# Trigger Dagster workflows
curl -X POST http://localhost:3030/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "mutation { launchRun(executionParams: {selector: {pipelineName: \"asset_property_pipeline\"}}) { run { id } } }"}'

3. Verify Data

# Check MinIO Bronze layer
docker exec -it minio-setup-bucket /usr/bin/mc ls minio/datalake/bronze/

# Check MinIO Prepare layer
docker exec -it minio-setup-bucket /usr/bin/mc ls minio/datalake/prepare/

# Check Delta tables
docker exec -it trino trino --server http://localhost:8080 --execute "SHOW TABLES FROM spark_catalog.flight_radar_prepared;"

Troubleshooting

Common Issues

Service Won't Start

# Check logs
docker-compose logs <service-name>

# Check resource usage
docker stats

# Restart service
docker-compose restart <service-name>

Database Connection Issues

# Check PostgreSQL status
docker-compose exec postgres pg_isready -U postgres

# Check database exists
docker-compose exec postgres psql -U postgres -c "\l"

# Reset database
docker-compose down -v
docker-compose up -d postgres init-postgres

MinIO Connection Issues

# Check MinIO status
docker-compose exec minio mc admin info minio

# Check Bronze layer buckets
docker-compose exec minio-setup-bucket /usr/bin/mc ls minio/datalake/bronze/

# Check Prepare layer buckets
docker-compose exec minio-setup-bucket /usr/bin/mc ls minio/datalake/prepare/

# Reset MinIO
docker-compose down -v
docker-compose up -d minio minio-setup

Spark Cluster Issues

# Check Spark master
docker-compose exec spark-driver spark-submit --version

# Check worker connectivity
docker-compose exec spark-worker-1 spark-submit --version

# Restart Spark cluster
docker-compose restart spark-master spark-worker spark-worker-b spark-worker-c

Performance Tuning

Memory Configuration

# Increase Spark memory
SPARK_WORKER_MEMORY=2G
SPARK_EXECUTOR_MEMORY=2G
SPARK_DRIVER_MEMORY=2G

# Increase Trino memory
TRINO_JVM_HEAP_SIZE=2G

Storage Configuration

# Increase MinIO storage
MINIO_STORAGE_SIZE=100G

# Configure MinIO caching
MINIO_CACHE_SIZE=1G

Log Management

# View all logs
docker-compose logs

# View specific service logs
docker-compose logs -f trino

# Save logs to file
docker-compose logs > deployment.log

# Clear old logs
docker system prune -f

Production Deployment

Security Considerations

# Change default passwords
POSTGRES_PASSWORD=<secure-password>
MINIO_ACCESS_KEY=<secure-access-key>
MINIO_SECRET_KEY=<secure-secret-key>

# Enable SSL/TLS
TRINO_HTTPS_ENABLED=true
SUPERSET_HTTPS_ENABLED=true

Scaling Configuration

# Add more Spark workers
docker-compose up -d spark-worker-d spark-worker-e

# Scale Trino workers
docker-compose up -d trino-worker-1 trino-worker-2

# Increase MinIO storage
MINIO_STORAGE_SIZE=1T

Backup Strategy

# Backup PostgreSQL
docker-compose exec postgres pg_dump -U postgres > backup.sql

# Backup MinIO data
docker-compose exec minio-setup-bucket /usr/bin/mc mirror minio/ /backup/

# Backup configurations
tar -czf config-backup.tar.gz docker-compose.yml .env trino/etc/ hive/conf/

Maintenance

Regular Tasks

# Update Docker images
docker-compose pull
docker-compose up -d

# Clean up unused resources
docker system prune -f

# Monitor disk usage
docker system df

# Check service health
./scripts/health-check.sh

Update Procedures

# Stop services
docker-compose down

# Pull latest images
docker-compose pull

# Start services
docker-compose up -d

# Verify deployment
./scripts/verify-deployment.sh

Last update: October 3, 2025
Created: October 3, 2025