Skip to content

🏗️ Iceberg Data Engineering Platform

Apache Iceberg Apache Spark Trino dbt Dagster Apache Superset MinIO PostgreSQL

📋 Project Overview

This project implements a modern lakehouse architecture using open-source technologies, demonstrating how to build a scalable, efficient, and cost-effective data platform. The lakehouse paradigm combines the best features of data lakes and data warehouses, enabling efficient data storage, processing, and analytics.

🎯 Key Features

  • Multi-domain Data Ingestion: Real estate, flight radar, and e-commerce data sources
  • ACID Transactions: Using Apache Iceberg for reliable data management
  • Distributed Processing: Apache Spark for scalable data transformations
  • SQL Analytics: Trino for high-performance querying
  • Data Orchestration: Dagster for workflow management
  • Business Intelligence: Apache Superset for visualization and dashboards
  • Security: Apache Ranger for access control and governance

🛠️ Technology Stack

Core Data Platform

  • Apache Iceberg - Table format for ACID transactions, schema evolution, and time travel
  • Apache Spark - Distributed data processing engine
  • Trino - High-performance distributed SQL query engine
  • MinIO - S3-compatible object storage

Data Engineering Tools

  • dbt - Data transformation and modeling
  • Dagster - Data orchestration and workflow management
  • Apache Superset - Data visualization and BI platform

Infrastructure & Security

Development & Testing

🚀 Quick Start

Prerequisites

  • Docker and Docker Compose
  • Git

1. Clone and Setup

git clone <repository-url>
cd iceberg_data_engineering

2. Start the Platform

docker-compose up -d

3. Access Services

Service URL Credentials
Trino http://localhost:8080 trino://trino@trino:8080
Dagster http://localhost:3030 -
Apache Superset http://localhost:8088 admin/admin
MinIO Console http://localhost:9001 minioadmin/minioadmin
Hue http://localhost:8888 -
Apache Ranger http://localhost:6080 admin/admin
Spark Master http://localhost:8081 -
Documentation http://localhost:8000 -

📊 Data Domains

🏠 Asset Property

  • Source: Real estate sales data
  • Pipeline: asset_property_pipeline.py
  • Models: Property sales, market trends

✈️ Flight Radar

  • Source: FlightRadar24 API
  • Pipeline: flight_radar_pipeline.py
  • Models: Airlines, airports, flight data

🛒 E-commerce

  • Source: Mock user product data
  • Pipeline: ecommerce_pipeline.py
  • Models: Customer orders, product analytics

🔄 Data Flow

graph TD
    A[External APIs] --> B[Dagster Orchestration]
    B --> C[Data Ingestion Ops]
    C --> D[MinIO Bronze Layer]
    D --> E[Data Preparation Ops]
    E --> F[MinIO Prepare Layer]
    F --> G[Delta Tables]
    G --> H[Success Sensor]
    H --> I[dbt Transformations]
    I --> J[MinIO Warehouse Layer]
    J --> K[Iceberg Tables]

    L[Hive Metastore] --> G
    M[Apache Ranger] --> I
    N[PostgreSQL] --> L

    subgraph "Storage Buckets"
        O[datalake/bronze: Raw Data]
        P[datalake/prepare: Delta Tables]
        Q[warehouse: Iceberg Tables]
    end

    subgraph "Trino Catalogs"
        R[spark_catalog: Preparation Tasks]
        S[flight_radar: Flight Analytics]
        T[ecommerce: E-commerce Analytics]
        U[asset_property: Property Analytics]
    end

    D --> O
    F --> P
    J --> Q

    G --> R
    K --> S
    K --> T
    K --> U

Data Pipeline Architecture

  1. Data Ingestion: External APIs → CSV/JSON/Parquet files in datalake/bronze
  2. Data Preparation: Bronze files → Delta tables in datalake/prepare
  3. Data Transformation: Delta tables → dbt models → Iceberg tables in warehouse
  4. Analytics: Domain-specific Trino catalogs for querying Iceberg tables
  5. Orchestration: Dagster manages the entire pipeline with sensors and schedules

🏗️ Architecture Components

Data Ingestion Layer

  • Custom Python modules for API data collection
  • MinIO Bronze Storage: Raw data stored as CSV/JSON/Parquet in datalake/bronze
  • Schema validation and data quality checks
  • Multi-format support: CSV, JSON, Parquet file outputs

Data Preparation Layer

  • Apache Spark for distributed data processing
  • MinIO Prepare Storage: Processed data stored as Delta tables in datalake/prepare
  • Delta Lake Integration: ACID transactions and schema evolution
  • Data cleaning and standardization

Data Transformation Layer

  • dbt transformations for business logic and analytical models
  • Success sensors trigger dbt jobs after preparation completion
  • Iceberg table generation in warehouse bucket
  • Domain-specific catalogs for analytical querying
  • Analytical model creation from prepared Delta tables

Analytics Layer

  • Trino for SQL analytics on Delta and Iceberg tables
  • Domain-specific catalogs: spark_catalog, flight_radar, ecommerce, asset_property
  • Apache Superset for visualization and dashboards
  • Hue for ad-hoc queries

Security & Governance

  • Apache Ranger for access control
  • Hive Metastore for metadata management
  • PostgreSQL for configuration storage

🚧 Roadmap

  • CI/CD Pipeline for GCP deployment
  • Data Quality Monitoring with Great Expectations
  • Streaming Data with Apache Kafka
  • ML Pipeline integration
  • Multi-cloud deployment options

🤝 Contributing

See Development Guide for detailed contribution instructions.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Last update: October 3, 2025
Created: June 15, 2025