🏗️ Iceberg Data Engineering Platform¶
📋 Project Overview¶
This project implements a modern lakehouse architecture using open-source technologies, demonstrating how to build a scalable, efficient, and cost-effective data platform. The lakehouse paradigm combines the best features of data lakes and data warehouses, enabling efficient data storage, processing, and analytics.
🎯 Key Features¶
- Multi-domain Data Ingestion: Real estate, flight radar, and e-commerce data sources
- ACID Transactions: Using Apache Iceberg for reliable data management
- Distributed Processing: Apache Spark for scalable data transformations
- SQL Analytics: Trino for high-performance querying
- Data Orchestration: Dagster for workflow management
- Business Intelligence: Apache Superset for visualization and dashboards
- Security: Apache Ranger for access control and governance
🛠️ Technology Stack¶
Core Data Platform¶
- Apache Iceberg - Table format for ACID transactions, schema evolution, and time travel
- Apache Spark - Distributed data processing engine
- Trino - High-performance distributed SQL query engine
- MinIO - S3-compatible object storage
Data Engineering Tools¶
- dbt - Data transformation and modeling
- Dagster - Data orchestration and workflow management
- Apache Superset - Data visualization and BI platform
Infrastructure & Security¶
- PostgreSQL - Metadata storage and configuration
- Hive Metastore - Table catalog and metadata management
- Apache Ranger - Security and access control
- Hue - SQL query interface
Development & Testing¶
- Docker - Containerization and deployment
- pytest - Testing framework
- pre-commit - Code quality hooks
🚀 Quick Start¶
Prerequisites¶
- Docker and Docker Compose
- Git
1. Clone and Setup¶
2. Start the Platform¶
3. Access Services¶
| Service | URL | Credentials |
|---|---|---|
| Trino | http://localhost:8080 | trino://trino@trino:8080 |
| Dagster | http://localhost:3030 | - |
| Apache Superset | http://localhost:8088 | admin/admin |
| MinIO Console | http://localhost:9001 | minioadmin/minioadmin |
| Hue | http://localhost:8888 | - |
| Apache Ranger | http://localhost:6080 | admin/admin |
| Spark Master | http://localhost:8081 | - |
| Documentation | http://localhost:8000 | - |
📊 Data Domains¶
🏠 Asset Property¶
- Source: Real estate sales data
- Pipeline:
asset_property_pipeline.py - Models: Property sales, market trends
✈️ Flight Radar¶
- Source: FlightRadar24 API
- Pipeline:
flight_radar_pipeline.py - Models: Airlines, airports, flight data
🛒 E-commerce¶
- Source: Mock user product data
- Pipeline:
ecommerce_pipeline.py - Models: Customer orders, product analytics
🔄 Data Flow¶
graph TD
A[External APIs] --> B[Dagster Orchestration]
B --> C[Data Ingestion Ops]
C --> D[MinIO Bronze Layer]
D --> E[Data Preparation Ops]
E --> F[MinIO Prepare Layer]
F --> G[Delta Tables]
G --> H[Success Sensor]
H --> I[dbt Transformations]
I --> J[MinIO Warehouse Layer]
J --> K[Iceberg Tables]
L[Hive Metastore] --> G
M[Apache Ranger] --> I
N[PostgreSQL] --> L
subgraph "Storage Buckets"
O[datalake/bronze: Raw Data]
P[datalake/prepare: Delta Tables]
Q[warehouse: Iceberg Tables]
end
subgraph "Trino Catalogs"
R[spark_catalog: Preparation Tasks]
S[flight_radar: Flight Analytics]
T[ecommerce: E-commerce Analytics]
U[asset_property: Property Analytics]
end
D --> O
F --> P
J --> Q
G --> R
K --> S
K --> T
K --> U
Data Pipeline Architecture¶
- Data Ingestion: External APIs → CSV/JSON/Parquet files in
datalake/bronze - Data Preparation: Bronze files → Delta tables in
datalake/prepare - Data Transformation: Delta tables → dbt models → Iceberg tables in
warehouse - Analytics: Domain-specific Trino catalogs for querying Iceberg tables
- Orchestration: Dagster manages the entire pipeline with sensors and schedules
🏗️ Architecture Components¶
Data Ingestion Layer¶
- Custom Python modules for API data collection
- MinIO Bronze Storage: Raw data stored as CSV/JSON/Parquet in
datalake/bronze - Schema validation and data quality checks
- Multi-format support: CSV, JSON, Parquet file outputs
Data Preparation Layer¶
- Apache Spark for distributed data processing
- MinIO Prepare Storage: Processed data stored as Delta tables in
datalake/prepare - Delta Lake Integration: ACID transactions and schema evolution
- Data cleaning and standardization
Data Transformation Layer¶
- dbt transformations for business logic and analytical models
- Success sensors trigger dbt jobs after preparation completion
- Iceberg table generation in
warehousebucket - Domain-specific catalogs for analytical querying
- Analytical model creation from prepared Delta tables
Analytics Layer¶
- Trino for SQL analytics on Delta and Iceberg tables
- Domain-specific catalogs:
spark_catalog,flight_radar,ecommerce,asset_property - Apache Superset for visualization and dashboards
- Hue for ad-hoc queries
Security & Governance¶
- Apache Ranger for access control
- Hive Metastore for metadata management
- PostgreSQL for configuration storage
🚧 Roadmap¶
- CI/CD Pipeline for GCP deployment
- Data Quality Monitoring with Great Expectations
- Streaming Data with Apache Kafka
- ML Pipeline integration
- Multi-cloud deployment options
🤝 Contributing¶
See Development Guide for detailed contribution instructions.
📄 License¶
This project is licensed under the MIT License - see the LICENSE file for details.
Last update:
October 3, 2025
Created: June 15, 2025
Created: June 15, 2025