Skip to content

Data Flow Architecture

This document describes the data flow architecture of the Iceberg Data Engineering Platform, detailing how data moves through the system from ingestion to analytics.

Overview

The platform implements a modern lakehouse architecture with three main layers:

  1. Bronze Layer: Raw data ingestion and storage
  2. Silver Layer: Data preparation and cleaning
  3. Gold Layer: Analytics-ready data with business logic

Data Flow Diagram

graph TD
    A[External APIs] --> B[Dagster Orchestration]
    B --> C[Data Ingestion Ops]
    C --> D[MinIO Bronze Layer]
    D --> E[Data Preparation Ops]
    E --> F[MinIO Silver Layer]
    F --> G[Delta Tables]
    G --> H[Success Sensor]
    H --> I[dbt Transformations]
    I --> J[MinIO Gold Layer]
    J --> K[Iceberg Tables]

    L[Hive Metastore] --> G
    M[Apache Ranger] --> I
    N[PostgreSQL] --> L

    subgraph "Storage Buckets"
        O[datalake/bronze: Raw Data]
        P[datalake/silver: Delta Tables]
        Q[warehouse: Iceberg Tables]
    end

    subgraph "Trino Catalogs"
        R[spark_catalog: Preparation Tasks]
        S[flight_radar: Flight Analytics]
        T[ecommerce: E-commerce Analytics]
        U[asset_property: Property Analytics]
    end

    D --> O
    F --> P
    J --> Q

    G --> R
    K --> S
    K --> T
    K --> U

Data Layers

Bronze Layer (Raw Data)

Purpose: Store raw, unprocessed data from external sources

Characteristics: - Immutable data storage - Multiple file formats (CSV, JSON, Parquet) - Schema-on-read approach - Data lineage tracking

Storage: MinIO datalake/bronze bucket

Data Sources: - Real estate APIs - FlightRadar24 API - E-commerce mock data

Silver Layer (Prepared Data)

Purpose: Clean, validate, and standardize data

Characteristics: - Data quality checks - Schema validation - Data type standardization - Duplicate removal

Storage: MinIO datalake/silver bucket with Delta Lake format

Processing: Apache Spark for distributed data processing

Gold Layer (Analytics Data)

Purpose: Business-ready data for analytics and reporting

Characteristics: - Business logic applied - Aggregated metrics - Optimized for query performance - ACID transactions

Storage: MinIO warehouse bucket with Apache Iceberg format

Processing: dbt transformations

Data Domains

Asset Property Domain

Data Flow: 1. Real estate API → Bronze layer (CSV/JSON) 2. Data cleaning → Silver layer (Delta tables) 3. Business transformations → Gold layer (Iceberg tables)

Key Tables: - property_sales: Individual property transactions - market_trends: Aggregated market metrics - property_features: Property characteristics

Flight Radar Domain

Data Flow: 1. FlightRadar24 API → Bronze layer (JSON) 2. Data normalization → Silver layer (Delta tables) 3. Flight analytics → Gold layer (Iceberg tables)

Key Tables: - airlines: Airline information - airports: Airport data - flights: Flight tracking data - routes: Flight routes and schedules

E-commerce Domain

Data Flow: 1. Mock data generation → Bronze layer (CSV) 2. Customer data processing → Silver layer (Delta tables) 3. Business metrics → Gold layer (Iceberg tables)

Key Tables: - customers: Customer information - orders: Order transactions - products: Product catalog - order_items: Order line items

Orchestration

Dagster Workflows

Asset Property Pipeline:

@asset
def asset_property_bronze():
    # Ingest real estate data
    pass

@asset(deps=[asset_property_bronze])
def asset_property_silver():
    # Clean and prepare data
    pass

@asset(deps=[asset_property_silver])
def asset_property_gold():
    # Apply business transformations
    pass

Flight Radar Pipeline:

@asset
def flight_radar_bronze():
    # Ingest flight data
    pass

@asset(deps=[flight_radar_bronze])
def flight_radar_silver():
    # Process flight data
    pass

@asset(deps=[flight_radar_silver])
def flight_radar_gold():
    # Create analytics tables
    pass

Data Quality

Validation Rules: - Schema validation - Data type checks - Range validation - Completeness checks - Uniqueness constraints

Monitoring: - Data freshness metrics - Quality score tracking - Alert notifications - Data lineage visualization

Storage Architecture

MinIO Object Storage

Bucket Structure:

datalake/
├── bronze/
│   ├── asset_property/
│   ├── flight_radar/
│   └── ecommerce/
├── silver/
│   ├── asset_property/
│   ├── flight_radar/
│   └── ecommerce/
└── warehouse/
    ├── asset_property/
    ├── flight_radar/
    └── ecommerce/

File Formats: - Bronze: CSV, JSON, Parquet - Silver: Delta Lake (Parquet + metadata) - Gold: Apache Iceberg (Parquet + metadata)

Metadata Management

Hive Metastore: - Table schemas - Partition information - Data lineage - Access permissions

Apache Ranger: - Security policies - Access control - Audit logging - Data governance

Query Engines

Trino Catalogs

spark_catalog: Silver layer queries - Delta table access - Data preparation tasks - ETL monitoring

Domain Catalogs: Gold layer analytics - asset_property: Property analytics - flight_radar: Flight analytics - ecommerce: E-commerce analytics

Performance Optimization

Partitioning Strategy: - Date-based partitioning - Geographic partitioning (for property data) - Airline-based partitioning (for flight data)

Indexing: - Primary key indexes - Foreign key indexes - Query-specific indexes

Monitoring and Observability

Data Lineage

Tracking: - Source to destination mapping - Transformation logic - Data quality metrics - Processing timestamps

Performance Metrics

Key Metrics: - Data ingestion rate - Processing latency - Query performance - Storage utilization

Alerting

Triggers: - Data quality failures - Processing delays - Storage capacity warnings - Query performance degradation

Best Practices

Data Modeling

  1. Use appropriate data types
  2. Implement proper partitioning
  3. Design for query patterns
  4. Maintain data lineage

Performance

  1. Optimize file sizes
  2. Use columnar formats
  3. Implement caching strategies
  4. Monitor query performance

Security

  1. Implement access controls
  2. Encrypt sensitive data
  3. Audit data access
  4. Regular security reviews

Last update: October 3, 2025
Created: October 3, 2025