utils.data_preparation package

Submodules

utils.data_preparation.constants module

utils.data_preparation.download module

exception utils.data_preparation.download.CustomTimeoutException(message='Unable to download the data')[source]

Bases: Exception

Custom timeout exception for Selenium WebDriver

utils.data_preparation.download.download(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') None[source]

Downloads data into a dir directory

Warning

This function strongly relies on the URL structure. Any errors are most likely caused by its chenges.

Parameters:

dir – target data directory

utils.data_preparation.download.downloaded(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') bool[source]

Waits until file is downloaded. It has to be the only one in the dir

Parameters:

dir – the directory into which some file is being downloaded

utils.data_preparation.download.get_element(driver: WebDriver, xpath: str, timeout: int = 3) WebElement[source]

Waits up to timeout seconds for the element pointed by xpath to to appear on site.

Parameters:
  • driver – a driver with loaded page

  • xpath – an element’s xpath on page

  • timeout – the number of seconds to wait unitl raising an error

Raises:

CustomTimeoutException

Returns:

Desired Element

utils.data_preparation.load_airports_additional module

utils.data_preparation.load_airports_additional.load_airports_details()[source]

utils.data_preparation.load_data module

utils.data_preparation.load_data.load_airports(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') DataFrame[source]

Loads airports from airports.csv

Parameters:

dir – target data directory

Returns:

DataFrame with loaded data

utils.data_preparation.load_data.load_carriers(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') DataFrame[source]

Loads carriers data from carriers.pkl

Parameters:

dir – target data directory

Returns:

DataFrame with loaded data

utils.data_preparation.load_data.load_flights(years: str | List[str] = 'all', cols: List[str] | None = None, dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') DataFrame[source]

Loads flight data into memory

Parameters:
  • years – “all” or all possible data, List of str from {“1987”, …, “2008”} for specific ones

  • cols – desired columns to be loaded, if None entire data is loaded

  • dir – target data directory

Returns:

DataFrame with loaded data

utils.data_preparation.load_data.load_pkl(filename: str, dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets')[source]

Utility function that loads a .pkl file into a pd.DataFrame

Parameters:
  • filename (str) – file name

  • dir – target data directory

Returns:

DataFrame with loaded data

utils.data_preparation.load_data.load_plane_data(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') DataFrame[source]

Loads plane data from plane-data.pkl

Parameters:

dir – target data directory

Returns:

DataFrame with loaded data

utils.data_preparation.load_data.prepare_data(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets', datetime_features: List[str] = []) None[source]

Downloads and extracts data. It assumes 3 possible situations: - dir is empty, so it downloads and extracts data on its own, - dir contains only a zip archive, so it extracts it on its own, - dir contains both a zip archive and its extracted data, it does nothing.

Warning

This function strongly relies on the URL structure. Any errors are most likely caused by its chenges.

Parameters:
  • dir – target data directory

  • datetime_features – List of columns that can be casted to datetime, which significantly reduces space usage

utils.data_preparation.load_data.unpack(dir: str, filename: str, datetime_features: List[str] = []) None[source]

Unpacks a filename into a dir

utils.data_preparation.optimize module

utils.data_preparation.optimize.concatenate(dfs: List[DataFrame], threshold: int = 50) DataFrame[source]

Concatenate while preserving categorical columns.

Parameters:
  • dfs – list of DataFrames to concatenate

  • threshold – target column will be left as categorical if unique values are less threshold % of all values

utils.data_preparation.optimize.convert_to_hhmm(df: DataFrame) DataFrame[source]

Converts every column of df into a hhmm string format

Returns:

modified df

utils.data_preparation.optimize.optimize(df: DataFrame, datetime_features: List[str] = [], flights_data: bool = False) None[source]

Optimizes data space usage

Parameters:
  • df – DataFrame holding data

  • datetime_features – List of columns that can be casted to datetime, which significantly reduces space usage

  • flights_data – special flag that triggers additional data conversions only for flights data

utils.data_preparation.optimize.optimize_floats(df: DataFrame) None[source]

Optimizes data space usage by casting float columns to smallest possible size

Parameters:

df – DataFrame holding data

utils.data_preparation.optimize.optimize_ints(df: DataFrame) None[source]

Optimizes data space usage by casting integer columns to smallest possible size

Parameters:

df – DataFrame holding data

utils.data_preparation.optimize.optimize_objects(df: DataFrame, datetime_features: List[str], threshold: int = 50) None[source]

Optimizes data space usage by casting object columns to pd.category if less or equal to threshold % of entries are unique, and datetime_features to pd.datetime

Parameters:
  • df – DataFrame holding data

  • datetime_features – List of columns that can be casted to datetime, which significantly reduces space usage

  • threshold – int from 0 to 100

Module contents