Installation Guide¶
This guide provides detailed installation instructions for the Iceberg Data Engineering Platform.
Installation Methods¶
Method 1: Docker Compose (Recommended)¶
This is the easiest and most reliable method for getting started.
Step 1: Clone the Repository¶
Step 2: Configure Environment (Optional)¶
Create a .env file to customize configuration:
Edit the .env file with your preferred settings:
# Database Configuration
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgrespassword
HIVE_DB_USER=hiveuser
HIVE_DB_PASSWORD=hivepassword
# MinIO Configuration
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin
# Superset Configuration
SUPERSET_ADMIN_USERNAME=admin
SUPERSET_ADMIN_PASSWORD=admin
Step 3: Start the Platform¶
# Start all services
docker-compose up -d
# Check service status
docker-compose ps
# View logs
docker-compose logs -f
Step 4: Verify Installation¶
Wait for all services to be healthy (this may take a few minutes):
# Check health status
docker-compose ps
# Test Trino connection
docker exec -it trino trino --server http://localhost:8080 --execute "SELECT 1"
# Test MinIO connection
docker exec -it minio-setup-bucket /usr/bin/mc ls minio/
Method 2: Manual Installation¶
For advanced users who want to customize the installation.
Prerequisites¶
- Python 3.8+
- Java 11+
- Node.js 16+
Step 1: Install Core Services¶
# Install PostgreSQL
sudo apt-get install postgresql postgresql-contrib
# Install Java
sudo apt-get install openjdk-11-jdk
# Install Python dependencies
pip install -r requirements.txt
Step 2: Configure Services¶
# Setup PostgreSQL databases
sudo -u postgres psql -c "CREATE DATABASE hive_metastore;"
sudo -u postgres psql -c "CREATE DATABASE ranger;"
sudo -u postgres psql -c "CREATE DATABASE hue;"
# Configure Hive Metastore
cp hive/conf/metastore-site.xml.template hive/conf/metastore-site.xml
# Edit metastore-site.xml with your configuration
Step 3: Start Services¶
# Start PostgreSQL
sudo systemctl start postgresql
# Start Hive Metastore
./hive/bin/hive --service metastore &
# Start Trino
./trino/bin/launcher start
# Start Dagster
dagster dev -w workspace.yaml
Configuration¶
Environment Variables¶
| Variable | Description | Default |
|---|---|---|
POSTGRES_USER |
PostgreSQL username | postgres |
POSTGRES_PASSWORD |
PostgreSQL password | postgrespassword |
MINIO_ROOT_USER |
MinIO admin username | minioadmin |
MINIO_ROOT_PASSWORD |
MinIO admin password | minioadmin |
SUPERSET_ADMIN_USERNAME |
Superset admin username | admin |
SUPERSET_ADMIN_PASSWORD |
Superset admin password | admin |
Service Configuration¶
Trino Configuration¶
Edit trino/etc/config.properties:
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
Spark Configuration¶
Edit spark/conf/spark-defaults.conf:
spark.master=spark://spark-master:7077
spark.executor.memory=1g
spark.driver.memory=1g
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
Post-Installation Setup¶
1. Initialize Data Sources¶
# Run initial data ingestion
docker-compose exec dagster dagster job execute -j asset_property_pipeline
docker-compose exec dagster dagster job execute -j flight_radar_pipeline
docker-compose exec dagster dagster job execute -j ecommerce_pipeline
2. Setup Superset Dashboards¶
# Import dashboards
docker-compose exec superset-bi superset import-dashboards -p /app/ecommerce_dashboard.zip
3. Configure Security¶
Verification¶
Test Data Pipeline¶
# Check data ingestion
docker-compose exec dagster dagster asset materialize -a asset_property_bronze
# Verify data in MinIO
docker-compose exec minio-setup-bucket /usr/bin/mc ls minio/datalake/bronze/
# Test Trino queries
docker-compose exec trino trino --execute "SHOW CATALOGS"
Test Analytics¶
# Query asset property data
docker-compose exec trino trino --catalog asset_property --execute "SELECT COUNT(*) FROM asset_property_sales"
# Query flight radar data
docker-compose exec trino trino --catalog flight_radar --execute "SELECT COUNT(*) FROM airlines"
Troubleshooting¶
Common Issues¶
Services not starting: Check Docker logs and ensure all prerequisites are met Database connection errors: Verify PostgreSQL is running and accessible Port conflicts: Check if required ports are available Memory issues: Increase Docker memory allocation
Logs and Debugging¶
# View all logs
docker-compose logs
# View specific service logs
docker-compose logs trino
docker-compose logs dagster
docker-compose logs spark-master
# Debug specific issues
docker-compose exec trino trino --debug
docker-compose exec dagster dagster --debug
Performance Tuning¶
For production deployments:
- Increase memory allocation for Docker
- Configure JVM settings for Java services
- Optimize Spark configuration for your workload
- Setup monitoring with Prometheus and Grafana
Next Steps¶
After successful installation:
- Explore the Platform: Visit http://localhost:8000 for documentation
- Run Sample Pipelines: Follow the Data Domains guides
- Monitor Operations: Use Dagster Orchestration
- Setup Production: Review the Deployment Guide
Created: October 3, 2025