Skip to content

πŸ—οΈ Architecture DocumentationΒΆ

System OverviewΒΆ

The Iceberg Data Engineering Platform implements a modern lakehouse architecture that combines the scalability of data lakes with the performance and reliability of data warehouses. The system is designed to handle multiple data domains (real estate, flight radar, e-commerce) with a unified processing pipeline.

High-Level ArchitectureΒΆ

graph TB
    subgraph "Data Sources"
        A1[Real Estate APIs]
        A2[FlightRadar24 API]
        A3[Mock E-commerce Data]
    end

    subgraph "Orchestration Layer"
        B[Dagster Workflows]
    end

    subgraph "Data Ingestion Layer"
        C1[Asset Property Ingestor]
        C2[Flight Radar Ingestor]
        C3[E-commerce Ingestor]
    end

    subgraph "Bronze Layer Storage"
        D[MinIO Bronze Storage]
        E[S3 Buckets: datalake/bronze]
    end

    subgraph "Data Preparation Layer"
        F[Apache Spark Processing]
        G[Data Preparation Ops]
    end

    subgraph "Prepare Layer Storage"
        H[MinIO Prepare Storage]
        I[S3 Buckets: datalake/prepare]
        J[Delta Tables]
    end

    subgraph "Data Transformation Layer"
        K[dbt Transformations]
        L[Success Sensors]
    end

    subgraph "Warehouse Layer Storage"
        O[MinIO Warehouse Storage]
        P[S3 Buckets: warehouse]
        Q[Iceberg Tables]
    end

    subgraph "Analytics Layer"
        M[Trino SQL Engine]
        N[Apache Superset]
        R[Hue SQL Interface]
    end

    subgraph "Metadata & Security"
        S[Hive Metastore]
        T[Apache Ranger]
        U[PostgreSQL]
    end

    A1 --> B
    A2 --> B
    A3 --> B
    B --> C1
    B --> C2
    B --> C3
    C1 --> D
    C2 --> D
    C3 --> D
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I
    I --> J
    J --> L
    L --> K
    K --> O
    O --> P
    P --> Q
    Q --> M
    M --> N
    M --> R
    S --> Q
    T --> M
    U --> S

Component DetailsΒΆ

1. Data Sources LayerΒΆ

Real Estate DataΒΆ

  • Source: External real estate APIs
  • Format: JSON/CSV
  • Frequency: Batch processing
  • Volume: Medium (thousands of records)

Flight Radar DataΒΆ

  • Source: FlightRadar24 API
  • Format: JSON
  • Frequency: Near real-time
  • Volume: High (millions of records)

E-commerce DataΒΆ

  • Source: Mock data generation
  • Format: JSON/CSV
  • Frequency: Batch processing
  • Volume: Medium (hundreds of thousands of records)

2. Orchestration LayerΒΆ

DagsterΒΆ

  • Purpose: Workflow orchestration and monitoring
  • Features:
  • Asset-based data lineage
  • Dependency management
  • Error handling and retries
  • Monitoring and alerting
  • Configuration: workspace.yaml

3. Data Ingestion LayerΒΆ

Custom Python ModulesΒΆ

  • Location: orchestration/ingestion/src/
  • Components:
  • Domain-specific ingestors (flight_radar, asset_property, user_product_mock)
  • Data validation using Pydantic models
  • Schema processing and data quality checks
  • Multi-format output support (CSV, JSON, Parquet)

Data QualityΒΆ

  • Schema validation using Pydantic models
  • Data type checking and conversion
  • Null value handling and data cleaning
  • Duplicate detection and deduplication

Output FormatsΒΆ

  • CSV: Structured tabular data
  • JSON: Semi-structured and nested data
  • Parquet: Columnar format for efficient storage and querying

4. Bronze Layer StorageΒΆ

MinIO Object StorageΒΆ

  • Purpose: Raw data storage in bronze layer
  • Location: datalake/bronze/ bucket
  • Formats: CSV, JSON, Parquet files
  • Features:
  • Versioning for data lineage
  • Lifecycle policies for data retention
  • Access control and security

Data OrganizationΒΆ

datalake/bronze/
β”œβ”€β”€ flight_radar/
β”‚   β”œβ”€β”€ airlines/
β”‚   β”œβ”€β”€ airports/
β”‚   β”œβ”€β”€ aircraft/
β”‚   └── flights/
β”œβ”€β”€ asset_property/
β”‚   └── real_estate_sales/
└── ecommerce/
    β”œβ”€β”€ products/
    └── customers/

5. Data Preparation LayerΒΆ

Apache Spark ProcessingΒΆ

  • Mode: Distributed processing
  • Configuration:
  • Master: 1 node
  • Workers: 3 nodes
  • Memory: 1GB per worker
  • Features:
  • DataFrame API for data processing
  • SQL queries for complex transformations
  • Delta Lake integration for ACID transactions

Data Preparation OperationsΒΆ

  • Data Cleaning: Remove duplicates, handle nulls, standardize formats
  • Schema Evolution: Handle schema changes and data type conversions
  • Data Enrichment: Add computed fields and business logic
  • Quality Checks: Validate data integrity and completeness

6. Prepare Layer StorageΒΆ

MinIO Prepare StorageΒΆ

  • Purpose: Processed data storage in prepare layer
  • Location: datalake/prepare/ bucket
  • Format: Delta tables
  • Features:
  • ACID transactions
  • Schema evolution
  • Time travel capabilities
  • Partitioning for performance

Delta Tables StructureΒΆ

datalake/prepare/
β”œβ”€β”€ flight_radar_prepared/
β”‚   β”œβ”€β”€ airline/
β”‚   β”œβ”€β”€ airport/
β”‚   β”œβ”€β”€ aircraft/
β”‚   β”œβ”€β”€ complete_flight/
β”‚   └── missing_flight/
β”œβ”€β”€ asset_property_prepared/
β”‚   └── real_estate_sales/
└── ecommerce_prepared/
    β”œβ”€β”€ product/
    └── customer/

7. Data Transformation LayerΒΆ

dbt TransformationsΒΆ

  • Purpose: Business logic and analytical model creation
  • Structure:
  • Sources: Raw Delta table definitions
  • Staging: Data cleaning and standardization
  • Intermediate: Business logic transformations
  • Marts: Final analytical tables

Success Sensor IntegrationΒΆ

  • Trigger: Preparation operations completion
  • Action: Automatic dbt job execution
  • Monitoring: Pipeline success/failure tracking
  • Retry Logic: Automatic retry on failures

8. Warehouse Layer StorageΒΆ

MinIO Warehouse StorageΒΆ

  • Purpose: Final Iceberg tables generated by dbt transformations
  • Location: warehouse/ bucket
  • Formats: Apache Iceberg tables
  • Features:
  • ACID transactions
  • Schema evolution
  • Time travel capabilities
  • Partitioning and clustering

Data OrganizationΒΆ

warehouse/
β”œβ”€β”€ flight_radar/
β”‚   β”œβ”€β”€ intermediate/
β”‚   β”‚   └── explode_complete_flight/
β”‚   └── marts/
β”‚       β”œβ”€β”€ company_with_the_most_active_flight/
β”‚       └── airport_with_the_largest_traffic_imbalance/
β”œβ”€β”€ ecommerce/
β”‚   β”œβ”€β”€ intermediate/
β”‚   β”‚   └── customer_order/
β”‚   └── marts/
β”‚       └── customer_analytics/
└── asset_property/
    β”œβ”€β”€ intermediate/
    β”‚   └── stg_real_estate_sales/
    └── marts/
        └── property_market_trends/

9. Analytics LayerΒΆ

Trino Query EngineΒΆ

  • Purpose: Distributed SQL querying on Delta and Iceberg tables
  • Features:
  • Delta Lake connector for preparation tasks (spark_catalog)
  • Iceberg connector for analytical models (domain-specific catalogs)
  • S3/MinIO integration
  • Hive Metastore integration
  • Ranger security integration
  • Catalog Structure:
  • spark_catalog: Access to Delta tables from preparation tasks
  • flight_radar: Iceberg tables for flight analytics
  • ecommerce: Iceberg tables for e-commerce analytics
  • asset_property: Iceberg tables for property analytics
  • Performance: Sub-second query response for analytical workloads

Apache SupersetΒΆ

  • Purpose: Business intelligence and visualization
  • Features:
  • Interactive dashboards
  • SQL Lab for ad-hoc queries
  • Chart types and visualizations
  • User management and permissions

HueΒΆ

  • Purpose: SQL query interface
  • Features:
  • Query editor
  • Query history
  • Result visualization
  • User-friendly interface

9. Metadata & SecurityΒΆ

Hive MetastoreΒΆ

  • Purpose: Table catalog and metadata management
  • Storage: PostgreSQL database
  • Features:
  • Table schema management
  • Partition information
  • Statistics and metadata

Apache RangerΒΆ

  • Purpose: Security and access control
  • Features:
  • Fine-grained access control
  • Policy management
  • Audit logging
  • Integration with Trino

PostgreSQLΒΆ

  • Purpose: Metadata and configuration storage
  • Databases:
  • hive_metastore: Hive metadata
  • ranger: Ranger policies
  • hue: Hue configuration

Data Flow PatternsΒΆ

1. Complete Data Pipeline PatternΒΆ

sequenceDiagram
    participant API as External API
    participant Dagster as Dagster Orchestrator
    participant Ingestor as Data Ingestor
    participant Bronze as MinIO Bronze
    participant Prep as Data Preparation
    participant Prepare as MinIO Prepare
    participant Delta as Delta Tables
    participant Sensor as Success Sensor
    participant dbt as dbt Transformations
    participant Analytics as Analytical Models

    API->>Dagster: Trigger ingestion workflow
    Dagster->>Ingestor: Execute ingestion operation
    Ingestor->>API: Fetch data from external source
    API-->>Ingestor: Return raw data
    Ingestor->>Bronze: Store CSV/JSON/Parquet files
    Bronze-->>Ingestor: Confirm storage
    Ingestor-->>Dagster: Ingestion completed

    Dagster->>Prep: Trigger preparation operation
    Prep->>Bronze: Read raw data files
    Bronze-->>Prep: Return raw data
    Prep->>Prep: Process and clean data
    Prep->>Prepare: Store as Delta tables
    Prepare-->>Prep: Confirm storage
    Prep-->>Dagster: Preparation completed

    Dagster->>Sensor: Notify preparation success
    Sensor->>dbt: Trigger dbt transformations
    dbt->>Delta: Read Delta tables
    Delta-->>dbt: Return prepared data
    dbt->>dbt: Apply business logic
    dbt->>Analytics: Create analytical models
    Analytics-->>dbt: Confirm model creation
    dbt-->>Sensor: Transformations completed

2. Bronze Layer Data FlowΒΆ

sequenceDiagram
    participant Ingestor as Data Ingestor
    participant Validator as Data Validator
    participant Formatter as Data Formatter
    participant Bronze as MinIO Bronze Storage

    Ingestor->>Validator: Raw data from API
    Validator->>Validator: Validate schema and quality
    Validator-->>Ingestor: Validation results

    alt Data is valid
        Ingestor->>Formatter: Format data (CSV/JSON/Parquet)
        Formatter->>Bronze: Store in datalake/bronze/
        Bronze-->>Formatter: Confirm storage
        Formatter-->>Ingestor: Storage successful
    else Data is invalid
        Ingestor->>Ingestor: Log error and skip
    end

3. Prepare Layer Data FlowΒΆ

sequenceDiagram
    participant Prep as Data Preparation
    participant Bronze as MinIO Bronze
    participant Spark as Apache Spark
    participant Processor as Data Processor
    participant Prepare as MinIO Prepare Storage
    participant Delta as Delta Tables

    Prep->>Bronze: Read raw data files
    Bronze-->>Prep: Return CSV/JSON/Parquet files
    Prep->>Spark: Initialize Spark session
    Spark->>Processor: Process raw data
    Processor->>Processor: Clean and transform data
    Processor->>Processor: Apply business logic
    Processor->>Prepare: Write Delta tables
    Prepare->>Delta: Create Delta table metadata
    Delta-->>Prepare: Confirm table creation
    Prepare-->>Processor: Storage successful
    Processor-->>Prep: Preparation completed

Scalability ConsiderationsΒΆ

Horizontal ScalingΒΆ

  • Spark Workers: Add more worker nodes
  • Trino Workers: Scale query processing
  • MinIO: Distributed storage across nodes

Vertical ScalingΒΆ

  • Memory: Increase worker memory allocation
  • CPU: Add more cores per worker
  • Storage: Increase MinIO storage capacity

Performance OptimizationΒΆ

  • Partitioning: Optimize Iceberg table partitioning
  • Caching: Implement query result caching
  • Indexing: Use appropriate indexes for frequent queries

Security ArchitectureΒΆ

AuthenticationΒΆ

  • MinIO: Access key/secret key
  • PostgreSQL: Username/password
  • Ranger: Admin credentials
  • Superset: User accounts

AuthorizationΒΆ

  • Ranger Policies: Fine-grained access control
  • MinIO Policies: Bucket-level permissions
  • Superset Roles: Dashboard and data access

Data ProtectionΒΆ

  • Encryption: Data at rest and in transit
  • Audit Logging: All access attempts logged
  • Backup: Regular data backups

Monitoring and ObservabilityΒΆ

System MetricsΒΆ

  • Resource Usage: CPU, memory, disk
  • Query Performance: Response times, throughput
  • Data Quality: Record counts, validation failures

Business MetricsΒΆ

  • Data Freshness: Time since last update
  • Pipeline Success: Success/failure rates
  • User Activity: Query patterns, dashboard usage

AlertingΒΆ

  • System Alerts: Resource thresholds
  • Data Alerts: Quality issues, pipeline failures
  • Business Alerts: SLA violations

Disaster RecoveryΒΆ

Backup StrategyΒΆ

  • Database Backups: PostgreSQL regular backups
  • Data Backups: MinIO data replication
  • Configuration Backups: Docker compose and configs

Recovery ProceduresΒΆ

  • Point-in-time Recovery: Using Iceberg time travel
  • Full System Recovery: Docker compose restart
  • Data Recovery: MinIO data restoration

Future EnhancementsΒΆ

Planned ImprovementsΒΆ

  • Streaming Data: Apache Kafka integration
  • ML Pipeline: Machine learning workflows
  • Data Quality: Great Expectations integration
  • Cloud Deployment: GCP/AWS/Azure support

Technology UpgradesΒΆ

  • Spark: Latest version adoption
  • Iceberg: New feature utilization
  • Trino: Performance optimizations
  • Superset: Enhanced visualizations

Last update: October 3, 2025
Created: October 3, 2025