Flight Radar Domain¶
This document describes the Flight Radar data domain, including data sources, processing pipelines, and analytical models for flight tracking and aviation analytics.
Overview¶
The Flight Radar domain focuses on aviation data, providing insights into flight operations, airline performance, airport traffic, and route analysis. This domain demonstrates how to handle real-time flight data in a modern data platform.
Data Sources¶
FlightRadar24 API¶
Primary Source: FlightRadar24 public API - Data Format: JSON - Update Frequency: Real-time (every 30 seconds) - Data Volume: ~50K flights per day globally - Schema: Flight details, aircraft information, route data
Data Schema¶
{
"flight_id": "string",
"airline_code": "string",
"airline_name": "string",
"aircraft_type": "string",
"aircraft_registration": "string",
"departure_airport": "string",
"arrival_airport": "string",
"departure_time": "timestamp",
"arrival_time": "timestamp",
"scheduled_departure": "timestamp",
"scheduled_arrival": "timestamp",
"flight_status": "string",
"altitude": "integer",
"speed": "integer",
"heading": "integer",
"latitude": "decimal",
"longitude": "decimal",
"route_distance": "decimal",
"flight_duration": "integer"
}
Data Pipeline¶
Bronze Layer (Raw Data)¶
Storage: MinIO datalake/bronze/flight_radar/
Format: JSON
Partitioning: By date and hour (year=YYYY/month=MM/day=DD/hour=HH)
@asset
def flight_radar_bronze():
"""Ingest raw flight data from FlightRadar24 API"""
# API data collection logic
# Real-time data streaming
# Schema validation
# Data quality checks
pass
Silver Layer (Prepared Data)¶
Storage: MinIO datalake/silver/flight_radar/
Format: Delta Lake
Processing: Apache Spark
@asset(deps=[flight_radar_bronze])
def flight_radar_silver():
"""Clean and prepare flight data"""
# Data cleaning
# Schema standardization
# Duplicate removal
# Data type conversion
# Route calculation
pass
Gold Layer (Analytics Data)¶
Storage: MinIO warehouse/flight_radar/
Format: Apache Iceberg
Processing: dbt transformations
-- Airlines table
CREATE TABLE airlines (
airline_code STRING,
airline_name STRING,
country STRING,
fleet_size INT,
founded_year INT,
created_at TIMESTAMP,
updated_at TIMESTAMP
) USING ICEBERG;
-- Airports table
CREATE TABLE airports (
airport_code STRING,
airport_name STRING,
city STRING,
country STRING,
latitude DECIMAL(10,8),
longitude DECIMAL(11,8),
timezone STRING,
created_at TIMESTAMP,
updated_at TIMESTAMP
) USING ICEBERG;
-- Flights table
CREATE TABLE flights (
flight_id STRING,
airline_code STRING,
aircraft_type STRING,
aircraft_registration STRING,
departure_airport STRING,
arrival_airport STRING,
departure_time TIMESTAMP,
arrival_time TIMESTAMP,
scheduled_departure TIMESTAMP,
scheduled_arrival TIMESTAMP,
flight_status STRING,
route_distance DECIMAL(10,2),
flight_duration INT,
created_at TIMESTAMP,
updated_at TIMESTAMP
) USING ICEBERG
PARTITIONED BY (departure_time);
Analytical Models¶
Flight Operations Model¶
Purpose: Core flight tracking data Key Metrics: - Flight frequency - On-time performance - Flight duration - Route efficiency
Airline Performance Model¶
Purpose: Airline operational analytics Key Metrics: - Fleet utilization - Route performance - Delay analysis - Customer satisfaction
Airport Traffic Model¶
Purpose: Airport operations analysis Key Metrics: - Traffic volume - Peak hours - Runway utilization - Capacity analysis
Route Analysis Model¶
Purpose: Route performance and optimization Key Metrics: - Route popularity - Distance efficiency - Frequency analysis - Market share
Data Quality¶
Validation Rules¶
# Data quality checks
def validate_flight_data(df):
"""Validate flight data quality"""
# Check required fields
required_fields = ['flight_id', 'airline_code', 'departure_airport', 'arrival_airport']
for field in required_fields:
assert df[field].isnull().sum() == 0, f"Missing values in {field}"
# Check data types
assert df['departure_time'].dtype == 'timestamp', "Invalid departure time format"
assert df['arrival_time'].dtype == 'timestamp', "Invalid arrival time format"
# Check value ranges
assert df['altitude'].min() >= 0, "Invalid altitude range"
assert df['speed'].min() >= 0, "Invalid speed range"
assert df['heading'].min() >= 0 and df['heading'].max() <= 360, "Invalid heading range"
# Check logical constraints
assert (df['arrival_time'] > df['departure_time']).all(), "Invalid flight times"
return True
Data Quality Metrics¶
- Completeness: Percentage of non-null values
- Accuracy: Data validation against source
- Consistency: Cross-field validation
- Timeliness: Data freshness tracking
Query Examples¶
Basic Queries¶
-- Top 10 busiest airports
SELECT
departure_airport,
COUNT(*) as departure_count
FROM flights
WHERE departure_time >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY departure_airport
ORDER BY departure_count DESC
LIMIT 10;
-- Airline performance by on-time percentage
SELECT
f.airline_code,
a.airline_name,
COUNT(*) as total_flights,
COUNT(CASE WHEN f.departure_time <= f.scheduled_departure THEN 1 END) as on_time_flights,
ROUND(COUNT(CASE WHEN f.departure_time <= f.scheduled_departure THEN 1 END) * 100.0 / COUNT(*), 2) as on_time_percentage
FROM flights f
JOIN airlines a ON f.airline_code = a.airline_code
WHERE f.departure_time >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY f.airline_code, a.airline_name
ORDER BY on_time_percentage DESC;
Advanced Analytics¶
-- Route analysis with market share
WITH route_stats AS (
SELECT
departure_airport,
arrival_airport,
COUNT(*) as total_flights,
COUNT(DISTINCT airline_code) as airlines_count,
AVG(flight_duration) as avg_duration,
AVG(route_distance) as avg_distance
FROM flights
WHERE departure_time >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY departure_airport, arrival_airport
),
airline_route_share AS (
SELECT
f.departure_airport,
f.arrival_airport,
f.airline_code,
a.airline_name,
COUNT(*) as airline_flights,
rs.total_flights,
ROUND(COUNT(*) * 100.0 / rs.total_flights, 2) as market_share
FROM flights f
JOIN airlines a ON f.airline_code = a.airline_code
JOIN route_stats rs ON f.departure_airport = rs.departure_airport
AND f.arrival_airport = rs.arrival_airport
WHERE f.departure_time >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY f.departure_airport, f.arrival_airport, f.airline_code, a.airline_name, rs.total_flights
)
SELECT
departure_airport,
arrival_airport,
airline_code,
airline_name,
airline_flights,
total_flights,
market_share
FROM airline_route_share
WHERE market_share >= 10 -- Routes with significant market share
ORDER BY total_flights DESC, market_share DESC;
Performance Optimization¶
Partitioning Strategy¶
-- Partition by departure time for time-based queries
PARTITIONED BY (departure_time)
-- Additional partitioning for geographic queries
PARTITIONED BY (departure_time, departure_airport)
Indexing¶
-- Create indexes for common query patterns
CREATE INDEX idx_flights_airline ON flights (airline_code);
CREATE INDEX idx_flights_departure_airport ON flights (departure_airport);
CREATE INDEX idx_flights_arrival_airport ON flights (arrival_airport);
CREATE INDEX idx_flights_departure_time ON flights (departure_time);
Caching¶
- Query Result Caching: Cache frequent query results
- Data Caching: Cache frequently accessed data
- Metadata Caching: Cache table metadata
Monitoring¶
Key Metrics¶
- Data Volume: Flights ingested per hour
- Data Quality: Quality score trends
- Processing Time: Pipeline execution duration
- Query Performance: Average query response time
Alerts¶
- Data Quality: Quality score below threshold
- Processing Delays: Pipeline execution time exceeded
- Data Volume: Unusual data volume changes
- Query Performance: Slow query detection
Business Use Cases¶
Aviation Analytics¶
- Flight Operations: Real-time flight tracking
- Airline Performance: Operational efficiency analysis
- Airport Management: Traffic and capacity analysis
- Route Optimization: Market analysis and planning
Customer Insights¶
- Travel Patterns: Passenger flow analysis
- Delay Analysis: Service quality metrics
- Market Trends: Industry performance trends
- Competitive Analysis: Market share analysis
Data Governance¶
Access Control¶
- Role-based Access: Different access levels for different users
- Data Classification: Sensitive data identification
- Audit Logging: All data access logged
- Compliance: Aviation regulations and privacy
Data Lineage¶
- Source Tracking: Data origin documentation
- Transformation Logging: All data transformations logged
- Quality Tracking: Data quality metrics over time
- Impact Analysis: Change impact assessment
Future Enhancements¶
Planned Features¶
- Real-time Streaming: Live flight tracking
- ML Models: Delay prediction models
- Geographic Analysis: Spatial data analysis
- Weather Integration: Weather impact analysis
Integration Opportunities¶
- External Data: Weather data, airport information
- ML Pipeline: Machine learning model integration
- API Development: Real-time data access APIs
- Dashboard Enhancement: Advanced visualization features
Created: October 3, 2025