Each data row contains details on:
The airline name and a chartered/scheduled type
The number of flights on the route
The average delay (in minutes)
The percentage of flights that fit into a set of different lateness bins
We’ve structured our data around the ‘Reporting Airport’ (the UK airport reporting the data). We’ve then extracted flight delay data and linked in airport passenger data for some cyprus mobile numbers of those years (although this has not been used in the analysis below).
This data has been combined to create a FastStats system. Comprising roughly 1.4 million rows of data representing over 31 million flights originating in the UK over the last 22 years, it’s not a large dataset. The data model isn’t particularly complicated – there are less than 30 variables, primarily selectors and numerics, with a date field representing the reporting month. However, there were a few issues in the data preparation phase which needed to be addressed:
Average minutes delay was allowed to be negative before the year 2000, and then the reporting was changed so this was set to 0 if there was a negative delay.
For many (but not all) intra-UK flight routes (e.g Aberdeen-Birmingham) the data was being duplicated for both airports, so we’ve flagged the data where this has occurred. This allows us to undertake analysis on flight routes and numbers by ignoring duplicates, and also analyse delays at each airport by factoring in the duplicate data.