Skip to content

Flight Radar Domain

This document describes the Flight Radar data domain, including data sources, processing pipelines, and analytical models for flight tracking and aviation analytics.

Overview

The Flight Radar domain focuses on aviation data, providing insights into flight operations, airline performance, airport traffic, and route analysis. This domain demonstrates how to handle real-time flight data in a modern data platform.

Data Sources

FlightRadar24 API

Primary Source: FlightRadar24 public API - Data Format: JSON - Update Frequency: Real-time (every 30 seconds) - Data Volume: ~50K flights per day globally - Schema: Flight details, aircraft information, route data

Data Schema

{
  "flight_id": "string",
  "airline_code": "string",
  "airline_name": "string",
  "aircraft_type": "string",
  "aircraft_registration": "string",
  "departure_airport": "string",
  "arrival_airport": "string",
  "departure_time": "timestamp",
  "arrival_time": "timestamp",
  "scheduled_departure": "timestamp",
  "scheduled_arrival": "timestamp",
  "flight_status": "string",
  "altitude": "integer",
  "speed": "integer",
  "heading": "integer",
  "latitude": "decimal",
  "longitude": "decimal",
  "route_distance": "decimal",
  "flight_duration": "integer"
}

Data Pipeline

Bronze Layer (Raw Data)

Storage: MinIO datalake/bronze/flight_radar/ Format: JSON Partitioning: By date and hour (year=YYYY/month=MM/day=DD/hour=HH)

@asset
def flight_radar_bronze():
    """Ingest raw flight data from FlightRadar24 API"""
    # API data collection logic
    # Real-time data streaming
    # Schema validation
    # Data quality checks
    pass

Silver Layer (Prepared Data)

Storage: MinIO datalake/silver/flight_radar/ Format: Delta Lake Processing: Apache Spark

@asset(deps=[flight_radar_bronze])
def flight_radar_silver():
    """Clean and prepare flight data"""
    # Data cleaning
    # Schema standardization
    # Duplicate removal
    # Data type conversion
    # Route calculation
    pass

Gold Layer (Analytics Data)

Storage: MinIO warehouse/flight_radar/ Format: Apache Iceberg Processing: dbt transformations

-- Airlines table
CREATE TABLE airlines (
    airline_code STRING,
    airline_name STRING,
    country STRING,
    fleet_size INT,
    founded_year INT,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
) USING ICEBERG;

-- Airports table
CREATE TABLE airports (
    airport_code STRING,
    airport_name STRING,
    city STRING,
    country STRING,
    latitude DECIMAL(10,8),
    longitude DECIMAL(11,8),
    timezone STRING,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
) USING ICEBERG;

-- Flights table
CREATE TABLE flights (
    flight_id STRING,
    airline_code STRING,
    aircraft_type STRING,
    aircraft_registration STRING,
    departure_airport STRING,
    arrival_airport STRING,
    departure_time TIMESTAMP,
    arrival_time TIMESTAMP,
    scheduled_departure TIMESTAMP,
    scheduled_arrival TIMESTAMP,
    flight_status STRING,
    route_distance DECIMAL(10,2),
    flight_duration INT,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
) USING ICEBERG
PARTITIONED BY (departure_time);

Analytical Models

Flight Operations Model

Purpose: Core flight tracking data Key Metrics: - Flight frequency - On-time performance - Flight duration - Route efficiency

Airline Performance Model

Purpose: Airline operational analytics Key Metrics: - Fleet utilization - Route performance - Delay analysis - Customer satisfaction

Airport Traffic Model

Purpose: Airport operations analysis Key Metrics: - Traffic volume - Peak hours - Runway utilization - Capacity analysis

Route Analysis Model

Purpose: Route performance and optimization Key Metrics: - Route popularity - Distance efficiency - Frequency analysis - Market share

Data Quality

Validation Rules

# Data quality checks
def validate_flight_data(df):
    """Validate flight data quality"""

    # Check required fields
    required_fields = ['flight_id', 'airline_code', 'departure_airport', 'arrival_airport']
    for field in required_fields:
        assert df[field].isnull().sum() == 0, f"Missing values in {field}"

    # Check data types
    assert df['departure_time'].dtype == 'timestamp', "Invalid departure time format"
    assert df['arrival_time'].dtype == 'timestamp', "Invalid arrival time format"

    # Check value ranges
    assert df['altitude'].min() >= 0, "Invalid altitude range"
    assert df['speed'].min() >= 0, "Invalid speed range"
    assert df['heading'].min() >= 0 and df['heading'].max() <= 360, "Invalid heading range"

    # Check logical constraints
    assert (df['arrival_time'] > df['departure_time']).all(), "Invalid flight times"

    return True

Data Quality Metrics

  • Completeness: Percentage of non-null values
  • Accuracy: Data validation against source
  • Consistency: Cross-field validation
  • Timeliness: Data freshness tracking

Query Examples

Basic Queries

-- Top 10 busiest airports
SELECT 
    departure_airport,
    COUNT(*) as departure_count
FROM flights
WHERE departure_time >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY departure_airport
ORDER BY departure_count DESC
LIMIT 10;

-- Airline performance by on-time percentage
SELECT 
    f.airline_code,
    a.airline_name,
    COUNT(*) as total_flights,
    COUNT(CASE WHEN f.departure_time <= f.scheduled_departure THEN 1 END) as on_time_flights,
    ROUND(COUNT(CASE WHEN f.departure_time <= f.scheduled_departure THEN 1 END) * 100.0 / COUNT(*), 2) as on_time_percentage
FROM flights f
JOIN airlines a ON f.airline_code = a.airline_code
WHERE f.departure_time >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY f.airline_code, a.airline_name
ORDER BY on_time_percentage DESC;

Advanced Analytics

-- Route analysis with market share
WITH route_stats AS (
    SELECT 
        departure_airport,
        arrival_airport,
        COUNT(*) as total_flights,
        COUNT(DISTINCT airline_code) as airlines_count,
        AVG(flight_duration) as avg_duration,
        AVG(route_distance) as avg_distance
    FROM flights
    WHERE departure_time >= CURRENT_DATE - INTERVAL '30' DAY
    GROUP BY departure_airport, arrival_airport
),
airline_route_share AS (
    SELECT 
        f.departure_airport,
        f.arrival_airport,
        f.airline_code,
        a.airline_name,
        COUNT(*) as airline_flights,
        rs.total_flights,
        ROUND(COUNT(*) * 100.0 / rs.total_flights, 2) as market_share
    FROM flights f
    JOIN airlines a ON f.airline_code = a.airline_code
    JOIN route_stats rs ON f.departure_airport = rs.departure_airport 
                        AND f.arrival_airport = rs.arrival_airport
    WHERE f.departure_time >= CURRENT_DATE - INTERVAL '30' DAY
    GROUP BY f.departure_airport, f.arrival_airport, f.airline_code, a.airline_name, rs.total_flights
)
SELECT 
    departure_airport,
    arrival_airport,
    airline_code,
    airline_name,
    airline_flights,
    total_flights,
    market_share
FROM airline_route_share
WHERE market_share >= 10  -- Routes with significant market share
ORDER BY total_flights DESC, market_share DESC;

Performance Optimization

Partitioning Strategy

-- Partition by departure time for time-based queries
PARTITIONED BY (departure_time)

-- Additional partitioning for geographic queries
PARTITIONED BY (departure_time, departure_airport)

Indexing

-- Create indexes for common query patterns
CREATE INDEX idx_flights_airline ON flights (airline_code);
CREATE INDEX idx_flights_departure_airport ON flights (departure_airport);
CREATE INDEX idx_flights_arrival_airport ON flights (arrival_airport);
CREATE INDEX idx_flights_departure_time ON flights (departure_time);

Caching

  • Query Result Caching: Cache frequent query results
  • Data Caching: Cache frequently accessed data
  • Metadata Caching: Cache table metadata

Monitoring

Key Metrics

  • Data Volume: Flights ingested per hour
  • Data Quality: Quality score trends
  • Processing Time: Pipeline execution duration
  • Query Performance: Average query response time

Alerts

  • Data Quality: Quality score below threshold
  • Processing Delays: Pipeline execution time exceeded
  • Data Volume: Unusual data volume changes
  • Query Performance: Slow query detection

Business Use Cases

Aviation Analytics

  • Flight Operations: Real-time flight tracking
  • Airline Performance: Operational efficiency analysis
  • Airport Management: Traffic and capacity analysis
  • Route Optimization: Market analysis and planning

Customer Insights

  • Travel Patterns: Passenger flow analysis
  • Delay Analysis: Service quality metrics
  • Market Trends: Industry performance trends
  • Competitive Analysis: Market share analysis

Data Governance

Access Control

  • Role-based Access: Different access levels for different users
  • Data Classification: Sensitive data identification
  • Audit Logging: All data access logged
  • Compliance: Aviation regulations and privacy

Data Lineage

  • Source Tracking: Data origin documentation
  • Transformation Logging: All data transformations logged
  • Quality Tracking: Data quality metrics over time
  • Impact Analysis: Change impact assessment

Future Enhancements

Planned Features

  • Real-time Streaming: Live flight tracking
  • ML Models: Delay prediction models
  • Geographic Analysis: Spatial data analysis
  • Weather Integration: Weather impact analysis

Integration Opportunities

  • External Data: Weather data, airport information
  • ML Pipeline: Machine learning model integration
  • API Development: Real-time data access APIs
  • Dashboard Enhancement: Advanced visualization features

Last update: October 3, 2025
Created: October 3, 2025