utils.data_preparation package
Submodules
utils.data_preparation.constants module
utils.data_preparation.download module
- exception utils.data_preparation.download.CustomTimeoutException(message='Unable to download the data')[source]
Bases:
ExceptionCustom timeout exception for Selenium WebDriver
- utils.data_preparation.download.download(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') None[source]
Downloads data into a dir directory
Warning
This function strongly relies on the URL structure. Any errors are most likely caused by its chenges.
- Parameters:
dir – target data directory
- utils.data_preparation.download.downloaded(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') bool[source]
Waits until file is downloaded. It has to be the only one in the dir
- Parameters:
dir – the directory into which some file is being downloaded
- utils.data_preparation.download.get_element(driver: WebDriver, xpath: str, timeout: int = 3) WebElement[source]
Waits up to timeout seconds for the element pointed by xpath to to appear on site.
- Parameters:
driver – a driver with loaded page
xpath – an element’s xpath on page
timeout – the number of seconds to wait unitl raising an error
- Raises:
CustomTimeoutException
- Returns:
Desired Element
utils.data_preparation.load_airports_additional module
utils.data_preparation.load_data module
- utils.data_preparation.load_data.load_airports(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') DataFrame[source]
Loads airports from airports.csv
- Parameters:
dir – target data directory
- Returns:
DataFrame with loaded data
- utils.data_preparation.load_data.load_carriers(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') DataFrame[source]
Loads carriers data from carriers.pkl
- Parameters:
dir – target data directory
- Returns:
DataFrame with loaded data
- utils.data_preparation.load_data.load_flights(years: str | List[str] = 'all', cols: List[str] | None = None, dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') DataFrame[source]
Loads flight data into memory
- Parameters:
years – “all” or all possible data, List of str from {“1987”, …, “2008”} for specific ones
cols – desired columns to be loaded, if None entire data is loaded
dir – target data directory
- Returns:
DataFrame with loaded data
- utils.data_preparation.load_data.load_pkl(filename: str, dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets')[source]
Utility function that loads a .pkl file into a pd.DataFrame
- Parameters:
filename (str) – file name
dir – target data directory
- Returns:
DataFrame with loaded data
- utils.data_preparation.load_data.load_plane_data(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets') DataFrame[source]
Loads plane data from plane-data.pkl
- Parameters:
dir – target data directory
- Returns:
DataFrame with loaded data
- utils.data_preparation.load_data.prepare_data(dir: str = '/home/runner/work/Airline-Performance-Data-Analysis/Airline-Performance-Data-Analysis/src/datasets', datetime_features: List[str] = []) None[source]
Downloads and extracts data. It assumes 3 possible situations: - dir is empty, so it downloads and extracts data on its own, - dir contains only a zip archive, so it extracts it on its own, - dir contains both a zip archive and its extracted data, it does nothing.
Warning
This function strongly relies on the URL structure. Any errors are most likely caused by its chenges.
- Parameters:
dir – target data directory
datetime_features – List of columns that can be casted to datetime, which significantly reduces space usage
utils.data_preparation.optimize module
- utils.data_preparation.optimize.concatenate(dfs: List[DataFrame], threshold: int = 50) DataFrame[source]
Concatenate while preserving categorical columns.
- Parameters:
dfs – list of DataFrames to concatenate
threshold – target column will be left as categorical if unique values are less threshold % of all values
- utils.data_preparation.optimize.convert_to_hhmm(df: DataFrame) DataFrame[source]
Converts every column of df into a hhmm string format
- Returns:
modified df
- utils.data_preparation.optimize.optimize(df: DataFrame, datetime_features: List[str] = [], flights_data: bool = False) None[source]
Optimizes data space usage
- Parameters:
df – DataFrame holding data
datetime_features – List of columns that can be casted to datetime, which significantly reduces space usage
flights_data – special flag that triggers additional data conversions only for flights data
- utils.data_preparation.optimize.optimize_floats(df: DataFrame) None[source]
Optimizes data space usage by casting float columns to smallest possible size
- Parameters:
df – DataFrame holding data
- utils.data_preparation.optimize.optimize_ints(df: DataFrame) None[source]
Optimizes data space usage by casting integer columns to smallest possible size
- Parameters:
df – DataFrame holding data
- utils.data_preparation.optimize.optimize_objects(df: DataFrame, datetime_features: List[str], threshold: int = 50) None[source]
Optimizes data space usage by casting object columns to pd.category if less or equal to threshold % of entries are unique, and datetime_features to pd.datetime
- Parameters:
df – DataFrame holding data
datetime_features – List of columns that can be casted to datetime, which significantly reduces space usage
threshold – int from 0 to 100