Data Flow Architecture¶
This document describes the data flow architecture of the Iceberg Data Engineering Platform, detailing how data moves through the system from ingestion to analytics.
Overview¶
The platform implements a modern lakehouse architecture with three main layers:
- Bronze Layer: Raw data ingestion and storage
- Silver Layer: Data preparation and cleaning
- Gold Layer: Analytics-ready data with business logic
Data Flow Diagram¶
graph TD
A[External APIs] --> B[Dagster Orchestration]
B --> C[Data Ingestion Ops]
C --> D[MinIO Bronze Layer]
D --> E[Data Preparation Ops]
E --> F[MinIO Silver Layer]
F --> G[Delta Tables]
G --> H[Success Sensor]
H --> I[dbt Transformations]
I --> J[MinIO Gold Layer]
J --> K[Iceberg Tables]
L[Hive Metastore] --> G
M[Apache Ranger] --> I
N[PostgreSQL] --> L
subgraph "Storage Buckets"
O[datalake/bronze: Raw Data]
P[datalake/silver: Delta Tables]
Q[warehouse: Iceberg Tables]
end
subgraph "Trino Catalogs"
R[spark_catalog: Preparation Tasks]
S[flight_radar: Flight Analytics]
T[ecommerce: E-commerce Analytics]
U[asset_property: Property Analytics]
end
D --> O
F --> P
J --> Q
G --> R
K --> S
K --> T
K --> U
Data Layers¶
Bronze Layer (Raw Data)¶
Purpose: Store raw, unprocessed data from external sources
Characteristics: - Immutable data storage - Multiple file formats (CSV, JSON, Parquet) - Schema-on-read approach - Data lineage tracking
Storage: MinIO datalake/bronze bucket
Data Sources: - Real estate APIs - FlightRadar24 API - E-commerce mock data
Silver Layer (Prepared Data)¶
Purpose: Clean, validate, and standardize data
Characteristics: - Data quality checks - Schema validation - Data type standardization - Duplicate removal
Storage: MinIO datalake/silver bucket with Delta Lake format
Processing: Apache Spark for distributed data processing
Gold Layer (Analytics Data)¶
Purpose: Business-ready data for analytics and reporting
Characteristics: - Business logic applied - Aggregated metrics - Optimized for query performance - ACID transactions
Storage: MinIO warehouse bucket with Apache Iceberg format
Processing: dbt transformations
Data Domains¶
Asset Property Domain¶
Data Flow: 1. Real estate API → Bronze layer (CSV/JSON) 2. Data cleaning → Silver layer (Delta tables) 3. Business transformations → Gold layer (Iceberg tables)
Key Tables:
- property_sales: Individual property transactions
- market_trends: Aggregated market metrics
- property_features: Property characteristics
Flight Radar Domain¶
Data Flow: 1. FlightRadar24 API → Bronze layer (JSON) 2. Data normalization → Silver layer (Delta tables) 3. Flight analytics → Gold layer (Iceberg tables)
Key Tables:
- airlines: Airline information
- airports: Airport data
- flights: Flight tracking data
- routes: Flight routes and schedules
E-commerce Domain¶
Data Flow: 1. Mock data generation → Bronze layer (CSV) 2. Customer data processing → Silver layer (Delta tables) 3. Business metrics → Gold layer (Iceberg tables)
Key Tables:
- customers: Customer information
- orders: Order transactions
- products: Product catalog
- order_items: Order line items
Orchestration¶
Dagster Workflows¶
Asset Property Pipeline:
@asset
def asset_property_bronze():
# Ingest real estate data
pass
@asset(deps=[asset_property_bronze])
def asset_property_silver():
# Clean and prepare data
pass
@asset(deps=[asset_property_silver])
def asset_property_gold():
# Apply business transformations
pass
Flight Radar Pipeline:
@asset
def flight_radar_bronze():
# Ingest flight data
pass
@asset(deps=[flight_radar_bronze])
def flight_radar_silver():
# Process flight data
pass
@asset(deps=[flight_radar_silver])
def flight_radar_gold():
# Create analytics tables
pass
Data Quality¶
Validation Rules: - Schema validation - Data type checks - Range validation - Completeness checks - Uniqueness constraints
Monitoring: - Data freshness metrics - Quality score tracking - Alert notifications - Data lineage visualization
Storage Architecture¶
MinIO Object Storage¶
Bucket Structure:
datalake/
├── bronze/
│ ├── asset_property/
│ ├── flight_radar/
│ └── ecommerce/
├── silver/
│ ├── asset_property/
│ ├── flight_radar/
│ └── ecommerce/
└── warehouse/
├── asset_property/
├── flight_radar/
└── ecommerce/
File Formats: - Bronze: CSV, JSON, Parquet - Silver: Delta Lake (Parquet + metadata) - Gold: Apache Iceberg (Parquet + metadata)
Metadata Management¶
Hive Metastore: - Table schemas - Partition information - Data lineage - Access permissions
Apache Ranger: - Security policies - Access control - Audit logging - Data governance
Query Engines¶
Trino Catalogs¶
spark_catalog: Silver layer queries - Delta table access - Data preparation tasks - ETL monitoring
Domain Catalogs: Gold layer analytics
- asset_property: Property analytics
- flight_radar: Flight analytics
- ecommerce: E-commerce analytics
Performance Optimization¶
Partitioning Strategy: - Date-based partitioning - Geographic partitioning (for property data) - Airline-based partitioning (for flight data)
Indexing: - Primary key indexes - Foreign key indexes - Query-specific indexes
Monitoring and Observability¶
Data Lineage¶
Tracking: - Source to destination mapping - Transformation logic - Data quality metrics - Processing timestamps
Performance Metrics¶
Key Metrics: - Data ingestion rate - Processing latency - Query performance - Storage utilization
Alerting¶
Triggers: - Data quality failures - Processing delays - Storage capacity warnings - Query performance degradation
Best Practices¶
Data Modeling¶
- Use appropriate data types
- Implement proper partitioning
- Design for query patterns
- Maintain data lineage
Performance¶
- Optimize file sizes
- Use columnar formats
- Implement caching strategies
- Monitor query performance
Security¶
- Implement access controls
- Encrypt sensitive data
- Audit data access
- Regular security reviews
Created: October 3, 2025