Skip to content

Installation Guide

This guide provides detailed installation instructions for the Iceberg Data Engineering Platform.

Installation Methods

This is the easiest and most reliable method for getting started.

Step 1: Clone the Repository

git clone <repository-url>
cd iceberg_data_engineering

Step 2: Configure Environment (Optional)

Create a .env file to customize configuration:

cp .env.example .env

Edit the .env file with your preferred settings:

# Database Configuration
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgrespassword
HIVE_DB_USER=hiveuser
HIVE_DB_PASSWORD=hivepassword

# MinIO Configuration
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin

# Superset Configuration
SUPERSET_ADMIN_USERNAME=admin
SUPERSET_ADMIN_PASSWORD=admin

Step 3: Start the Platform

# Start all services
docker-compose up -d

# Check service status
docker-compose ps

# View logs
docker-compose logs -f

Step 4: Verify Installation

Wait for all services to be healthy (this may take a few minutes):

# Check health status
docker-compose ps

# Test Trino connection
docker exec -it trino trino --server http://localhost:8080 --execute "SELECT 1"

# Test MinIO connection
docker exec -it minio-setup-bucket /usr/bin/mc ls minio/

Method 2: Manual Installation

For advanced users who want to customize the installation.

Prerequisites

  • Python 3.8+
  • Java 11+
  • Node.js 16+

Step 1: Install Core Services

# Install PostgreSQL
sudo apt-get install postgresql postgresql-contrib

# Install Java
sudo apt-get install openjdk-11-jdk

# Install Python dependencies
pip install -r requirements.txt

Step 2: Configure Services

# Setup PostgreSQL databases
sudo -u postgres psql -c "CREATE DATABASE hive_metastore;"
sudo -u postgres psql -c "CREATE DATABASE ranger;"
sudo -u postgres psql -c "CREATE DATABASE hue;"

# Configure Hive Metastore
cp hive/conf/metastore-site.xml.template hive/conf/metastore-site.xml
# Edit metastore-site.xml with your configuration

Step 3: Start Services

# Start PostgreSQL
sudo systemctl start postgresql

# Start Hive Metastore
./hive/bin/hive --service metastore &

# Start Trino
./trino/bin/launcher start

# Start Dagster
dagster dev -w workspace.yaml

Configuration

Environment Variables

Variable Description Default
POSTGRES_USER PostgreSQL username postgres
POSTGRES_PASSWORD PostgreSQL password postgrespassword
MINIO_ROOT_USER MinIO admin username minioadmin
MINIO_ROOT_PASSWORD MinIO admin password minioadmin
SUPERSET_ADMIN_USERNAME Superset admin username admin
SUPERSET_ADMIN_PASSWORD Superset admin password admin

Service Configuration

Trino Configuration

Edit trino/etc/config.properties:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB

Spark Configuration

Edit spark/conf/spark-defaults.conf:

spark.master=spark://spark-master:7077
spark.executor.memory=1g
spark.driver.memory=1g
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true

Post-Installation Setup

1. Initialize Data Sources

# Run initial data ingestion
docker-compose exec dagster dagster job execute -j asset_property_pipeline
docker-compose exec dagster dagster job execute -j flight_radar_pipeline
docker-compose exec dagster dagster job execute -j ecommerce_pipeline

2. Setup Superset Dashboards

# Import dashboards
docker-compose exec superset-bi superset import-dashboards -p /app/ecommerce_dashboard.zip

3. Configure Security

# Setup Ranger policies
docker-compose exec ranger ranger-admin setup

Verification

Test Data Pipeline

# Check data ingestion
docker-compose exec dagster dagster asset materialize -a asset_property_bronze

# Verify data in MinIO
docker-compose exec minio-setup-bucket /usr/bin/mc ls minio/datalake/bronze/

# Test Trino queries
docker-compose exec trino trino --execute "SHOW CATALOGS"

Test Analytics

# Query asset property data
docker-compose exec trino trino --catalog asset_property --execute "SELECT COUNT(*) FROM asset_property_sales"

# Query flight radar data
docker-compose exec trino trino --catalog flight_radar --execute "SELECT COUNT(*) FROM airlines"

Troubleshooting

Common Issues

Services not starting: Check Docker logs and ensure all prerequisites are met Database connection errors: Verify PostgreSQL is running and accessible Port conflicts: Check if required ports are available Memory issues: Increase Docker memory allocation

Logs and Debugging

# View all logs
docker-compose logs

# View specific service logs
docker-compose logs trino
docker-compose logs dagster
docker-compose logs spark-master

# Debug specific issues
docker-compose exec trino trino --debug
docker-compose exec dagster dagster --debug

Performance Tuning

For production deployments:

  1. Increase memory allocation for Docker
  2. Configure JVM settings for Java services
  3. Optimize Spark configuration for your workload
  4. Setup monitoring with Prometheus and Grafana

Next Steps

After successful installation:

  1. Explore the Platform: Visit http://localhost:8000 for documentation
  2. Run Sample Pipelines: Follow the Data Domains guides
  3. Monitor Operations: Use Dagster Orchestration
  4. Setup Production: Review the Deployment Guide

Last update: October 3, 2025
Created: October 3, 2025