ποΈ Architecture DocumentationΒΆ
System OverviewΒΆ
The Iceberg Data Engineering Platform implements a modern lakehouse architecture that combines the scalability of data lakes with the performance and reliability of data warehouses. The system is designed to handle multiple data domains (real estate, flight radar, e-commerce) with a unified processing pipeline.
High-Level ArchitectureΒΆ
graph TB
subgraph "Data Sources"
A1[Real Estate APIs]
A2[FlightRadar24 API]
A3[Mock E-commerce Data]
end
subgraph "Orchestration Layer"
B[Dagster Workflows]
end
subgraph "Data Ingestion Layer"
C1[Asset Property Ingestor]
C2[Flight Radar Ingestor]
C3[E-commerce Ingestor]
end
subgraph "Bronze Layer Storage"
D[MinIO Bronze Storage]
E[S3 Buckets: datalake/bronze]
end
subgraph "Data Preparation Layer"
F[Apache Spark Processing]
G[Data Preparation Ops]
end
subgraph "Prepare Layer Storage"
H[MinIO Prepare Storage]
I[S3 Buckets: datalake/prepare]
J[Delta Tables]
end
subgraph "Data Transformation Layer"
K[dbt Transformations]
L[Success Sensors]
end
subgraph "Warehouse Layer Storage"
O[MinIO Warehouse Storage]
P[S3 Buckets: warehouse]
Q[Iceberg Tables]
end
subgraph "Analytics Layer"
M[Trino SQL Engine]
N[Apache Superset]
R[Hue SQL Interface]
end
subgraph "Metadata & Security"
S[Hive Metastore]
T[Apache Ranger]
U[PostgreSQL]
end
A1 --> B
A2 --> B
A3 --> B
B --> C1
B --> C2
B --> C3
C1 --> D
C2 --> D
C3 --> D
D --> E
E --> F
F --> G
G --> H
H --> I
I --> J
J --> L
L --> K
K --> O
O --> P
P --> Q
Q --> M
M --> N
M --> R
S --> Q
T --> M
U --> S
Component DetailsΒΆ
1. Data Sources LayerΒΆ
Real Estate DataΒΆ
- Source: External real estate APIs
- Format: JSON/CSV
- Frequency: Batch processing
- Volume: Medium (thousands of records)
Flight Radar DataΒΆ
- Source: FlightRadar24 API
- Format: JSON
- Frequency: Near real-time
- Volume: High (millions of records)
E-commerce DataΒΆ
- Source: Mock data generation
- Format: JSON/CSV
- Frequency: Batch processing
- Volume: Medium (hundreds of thousands of records)
2. Orchestration LayerΒΆ
DagsterΒΆ
- Purpose: Workflow orchestration and monitoring
- Features:
- Asset-based data lineage
- Dependency management
- Error handling and retries
- Monitoring and alerting
- Configuration:
workspace.yaml
3. Data Ingestion LayerΒΆ
Custom Python ModulesΒΆ
- Location:
orchestration/ingestion/src/ - Components:
- Domain-specific ingestors (flight_radar, asset_property, user_product_mock)
- Data validation using Pydantic models
- Schema processing and data quality checks
- Multi-format output support (CSV, JSON, Parquet)
Data QualityΒΆ
- Schema validation using Pydantic models
- Data type checking and conversion
- Null value handling and data cleaning
- Duplicate detection and deduplication
Output FormatsΒΆ
- CSV: Structured tabular data
- JSON: Semi-structured and nested data
- Parquet: Columnar format for efficient storage and querying
4. Bronze Layer StorageΒΆ
MinIO Object StorageΒΆ
- Purpose: Raw data storage in bronze layer
- Location:
datalake/bronze/bucket - Formats: CSV, JSON, Parquet files
- Features:
- Versioning for data lineage
- Lifecycle policies for data retention
- Access control and security
Data OrganizationΒΆ
datalake/bronze/
βββ flight_radar/
β βββ airlines/
β βββ airports/
β βββ aircraft/
β βββ flights/
βββ asset_property/
β βββ real_estate_sales/
βββ ecommerce/
βββ products/
βββ customers/
5. Data Preparation LayerΒΆ
Apache Spark ProcessingΒΆ
- Mode: Distributed processing
- Configuration:
- Master: 1 node
- Workers: 3 nodes
- Memory: 1GB per worker
- Features:
- DataFrame API for data processing
- SQL queries for complex transformations
- Delta Lake integration for ACID transactions
Data Preparation OperationsΒΆ
- Data Cleaning: Remove duplicates, handle nulls, standardize formats
- Schema Evolution: Handle schema changes and data type conversions
- Data Enrichment: Add computed fields and business logic
- Quality Checks: Validate data integrity and completeness
6. Prepare Layer StorageΒΆ
MinIO Prepare StorageΒΆ
- Purpose: Processed data storage in prepare layer
- Location:
datalake/prepare/bucket - Format: Delta tables
- Features:
- ACID transactions
- Schema evolution
- Time travel capabilities
- Partitioning for performance
Delta Tables StructureΒΆ
datalake/prepare/
βββ flight_radar_prepared/
β βββ airline/
β βββ airport/
β βββ aircraft/
β βββ complete_flight/
β βββ missing_flight/
βββ asset_property_prepared/
β βββ real_estate_sales/
βββ ecommerce_prepared/
βββ product/
βββ customer/
7. Data Transformation LayerΒΆ
dbt TransformationsΒΆ
- Purpose: Business logic and analytical model creation
- Structure:
- Sources: Raw Delta table definitions
- Staging: Data cleaning and standardization
- Intermediate: Business logic transformations
- Marts: Final analytical tables
Success Sensor IntegrationΒΆ
- Trigger: Preparation operations completion
- Action: Automatic dbt job execution
- Monitoring: Pipeline success/failure tracking
- Retry Logic: Automatic retry on failures
8. Warehouse Layer StorageΒΆ
MinIO Warehouse StorageΒΆ
- Purpose: Final Iceberg tables generated by dbt transformations
- Location:
warehouse/bucket - Formats: Apache Iceberg tables
- Features:
- ACID transactions
- Schema evolution
- Time travel capabilities
- Partitioning and clustering
Data OrganizationΒΆ
warehouse/
βββ flight_radar/
β βββ intermediate/
β β βββ explode_complete_flight/
β βββ marts/
β βββ company_with_the_most_active_flight/
β βββ airport_with_the_largest_traffic_imbalance/
βββ ecommerce/
β βββ intermediate/
β β βββ customer_order/
β βββ marts/
β βββ customer_analytics/
βββ asset_property/
βββ intermediate/
β βββ stg_real_estate_sales/
βββ marts/
βββ property_market_trends/
9. Analytics LayerΒΆ
Trino Query EngineΒΆ
- Purpose: Distributed SQL querying on Delta and Iceberg tables
- Features:
- Delta Lake connector for preparation tasks (
spark_catalog) - Iceberg connector for analytical models (domain-specific catalogs)
- S3/MinIO integration
- Hive Metastore integration
- Ranger security integration
- Catalog Structure:
spark_catalog: Access to Delta tables from preparation tasksflight_radar: Iceberg tables for flight analyticsecommerce: Iceberg tables for e-commerce analyticsasset_property: Iceberg tables for property analytics- Performance: Sub-second query response for analytical workloads
Apache SupersetΒΆ
- Purpose: Business intelligence and visualization
- Features:
- Interactive dashboards
- SQL Lab for ad-hoc queries
- Chart types and visualizations
- User management and permissions
HueΒΆ
- Purpose: SQL query interface
- Features:
- Query editor
- Query history
- Result visualization
- User-friendly interface
9. Metadata & SecurityΒΆ
Hive MetastoreΒΆ
- Purpose: Table catalog and metadata management
- Storage: PostgreSQL database
- Features:
- Table schema management
- Partition information
- Statistics and metadata
Apache RangerΒΆ
- Purpose: Security and access control
- Features:
- Fine-grained access control
- Policy management
- Audit logging
- Integration with Trino
PostgreSQLΒΆ
- Purpose: Metadata and configuration storage
- Databases:
hive_metastore: Hive metadataranger: Ranger policieshue: Hue configuration
Data Flow PatternsΒΆ
1. Complete Data Pipeline PatternΒΆ
sequenceDiagram
participant API as External API
participant Dagster as Dagster Orchestrator
participant Ingestor as Data Ingestor
participant Bronze as MinIO Bronze
participant Prep as Data Preparation
participant Prepare as MinIO Prepare
participant Delta as Delta Tables
participant Sensor as Success Sensor
participant dbt as dbt Transformations
participant Analytics as Analytical Models
API->>Dagster: Trigger ingestion workflow
Dagster->>Ingestor: Execute ingestion operation
Ingestor->>API: Fetch data from external source
API-->>Ingestor: Return raw data
Ingestor->>Bronze: Store CSV/JSON/Parquet files
Bronze-->>Ingestor: Confirm storage
Ingestor-->>Dagster: Ingestion completed
Dagster->>Prep: Trigger preparation operation
Prep->>Bronze: Read raw data files
Bronze-->>Prep: Return raw data
Prep->>Prep: Process and clean data
Prep->>Prepare: Store as Delta tables
Prepare-->>Prep: Confirm storage
Prep-->>Dagster: Preparation completed
Dagster->>Sensor: Notify preparation success
Sensor->>dbt: Trigger dbt transformations
dbt->>Delta: Read Delta tables
Delta-->>dbt: Return prepared data
dbt->>dbt: Apply business logic
dbt->>Analytics: Create analytical models
Analytics-->>dbt: Confirm model creation
dbt-->>Sensor: Transformations completed
2. Bronze Layer Data FlowΒΆ
sequenceDiagram
participant Ingestor as Data Ingestor
participant Validator as Data Validator
participant Formatter as Data Formatter
participant Bronze as MinIO Bronze Storage
Ingestor->>Validator: Raw data from API
Validator->>Validator: Validate schema and quality
Validator-->>Ingestor: Validation results
alt Data is valid
Ingestor->>Formatter: Format data (CSV/JSON/Parquet)
Formatter->>Bronze: Store in datalake/bronze/
Bronze-->>Formatter: Confirm storage
Formatter-->>Ingestor: Storage successful
else Data is invalid
Ingestor->>Ingestor: Log error and skip
end
3. Prepare Layer Data FlowΒΆ
sequenceDiagram
participant Prep as Data Preparation
participant Bronze as MinIO Bronze
participant Spark as Apache Spark
participant Processor as Data Processor
participant Prepare as MinIO Prepare Storage
participant Delta as Delta Tables
Prep->>Bronze: Read raw data files
Bronze-->>Prep: Return CSV/JSON/Parquet files
Prep->>Spark: Initialize Spark session
Spark->>Processor: Process raw data
Processor->>Processor: Clean and transform data
Processor->>Processor: Apply business logic
Processor->>Prepare: Write Delta tables
Prepare->>Delta: Create Delta table metadata
Delta-->>Prepare: Confirm table creation
Prepare-->>Processor: Storage successful
Processor-->>Prep: Preparation completed
Scalability ConsiderationsΒΆ
Horizontal ScalingΒΆ
- Spark Workers: Add more worker nodes
- Trino Workers: Scale query processing
- MinIO: Distributed storage across nodes
Vertical ScalingΒΆ
- Memory: Increase worker memory allocation
- CPU: Add more cores per worker
- Storage: Increase MinIO storage capacity
Performance OptimizationΒΆ
- Partitioning: Optimize Iceberg table partitioning
- Caching: Implement query result caching
- Indexing: Use appropriate indexes for frequent queries
Security ArchitectureΒΆ
AuthenticationΒΆ
- MinIO: Access key/secret key
- PostgreSQL: Username/password
- Ranger: Admin credentials
- Superset: User accounts
AuthorizationΒΆ
- Ranger Policies: Fine-grained access control
- MinIO Policies: Bucket-level permissions
- Superset Roles: Dashboard and data access
Data ProtectionΒΆ
- Encryption: Data at rest and in transit
- Audit Logging: All access attempts logged
- Backup: Regular data backups
Monitoring and ObservabilityΒΆ
System MetricsΒΆ
- Resource Usage: CPU, memory, disk
- Query Performance: Response times, throughput
- Data Quality: Record counts, validation failures
Business MetricsΒΆ
- Data Freshness: Time since last update
- Pipeline Success: Success/failure rates
- User Activity: Query patterns, dashboard usage
AlertingΒΆ
- System Alerts: Resource thresholds
- Data Alerts: Quality issues, pipeline failures
- Business Alerts: SLA violations
Disaster RecoveryΒΆ
Backup StrategyΒΆ
- Database Backups: PostgreSQL regular backups
- Data Backups: MinIO data replication
- Configuration Backups: Docker compose and configs
Recovery ProceduresΒΆ
- Point-in-time Recovery: Using Iceberg time travel
- Full System Recovery: Docker compose restart
- Data Recovery: MinIO data restoration
Future EnhancementsΒΆ
Planned ImprovementsΒΆ
- Streaming Data: Apache Kafka integration
- ML Pipeline: Machine learning workflows
- Data Quality: Great Expectations integration
- Cloud Deployment: GCP/AWS/Azure support
Technology UpgradesΒΆ
- Spark: Latest version adoption
- Iceberg: New feature utilization
- Trino: Performance optimizations
- Superset: Enhanced visualizations
Last update:
October 3, 2025
Created: October 3, 2025
Created: October 3, 2025