auto_prep.preprocessing package

Submodules

auto_prep.preprocessing.abstract module

class auto_prep.preprocessing.abstract.DimentionReducer[source]

Bases: NonRequiredStep, Numerical, ABC

Abstract class for dimensionality reduction techniques.

abstract fit(X: DataFrame, y: Series | None = None) DimentionReducer[source]
abstract fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

abstract to_tex() dict[source]

Returns a short description in form of dictionary. Keys are: name - transformer name, desc - short description, params - class parameters (if None then {}).

abstract transform(X: DataFrame, y: Series = None) DataFrame[source]
class auto_prep.preprocessing.abstract.FeatureImportanceSelector(k: float = 10.0)[source]

Bases: NonRequiredStep

Transformer to select k% (rounded to whole number) of features that are most important according to Random Forest model.

k

The percentage of top features to keep based on their importance.

Type:

float

selected_columns

List of selected columns based on feature importance.

Type:

list

fit(X: DataFrame, y: Series | None = None) FeatureImportanceSelector[source]

Identifies the top k% (rounded to whole value) of features most important according to the model.

Parameters:
  • X (pd.DataFrame) – The input feature data.

  • y (pd.Series) – The target variable.

Returns:

The fitted transformer instance.

Return type:

FeatureImportanceSelector

fit_transform(X: DataFrame, y: Series) DataFrame[source]

Fits and transforms the data by selecting the top k% most important features. Performs fit and transform in one step.

Parameters:
  • X (pd.DataFrame) – The feature data.

  • y (pd.Series) – The target variable.

Returns:

The transformed data with selected features.

Return type:

pd.DataFrame

transform(X: DataFrame, y: Series = None) DataFrame[source]

Selects the top k% of features most important according to the model.

Parameters:
  • X (pd.DataFrame) – The feature data.

  • y (pd.Series, optional) – The target variable (to append to the result).

Returns:

The transformed data with only the selected top k% important features.

Return type:

pd.DataFrame

auto_prep.preprocessing.binning module

class auto_prep.preprocessing.binning.BinningTransformer(binning_method: str = 'qcut')[source]

Bases: NonRequiredStep, Numerical

Transformer for performing binning (using qcut) or equal-width binning (using cut) on continuous variables and replacing the values with numeric labels, but only if the number of unique values exceeds 50% of the number of samples in the column.

threshold

percent of unique values in a column in order to classify for binning. Default : 0.5

Type:

float

should_bin

dictionary to track which columns should be binned.

Type:

dict

bin_edges

dictionary to store the bin edged for each column.

Type:

dict

PARAMS_GRID = {'binning_method': ['qcut', 'cut']}
fit(X: DataFrame, y: Series | None = None) BinningTransformer[source]

Fits the transformer by calculating the bin edges for each continuous column if the number of unique values exceeds the threshold of 50%.

Parameters:

X (pd.DataFrame) – The input feature data.

Returns:

The fitted transformer instance.

Return type:

BinningTransformer

fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fits and transforms the data in one step.

Parameters:

X (pd.DataFrame) – The feature data.

Returns:

The transformed data with bin labels.

Return type:

pd.DataFrame

is_numeric() bool[source]
to_tex() dict[source]

Returns a description of the transformer in dictionary format.

Returns:

Description of the transformer.

Return type:

dict

transform(X: DataFrame) DataFrame[source]

Transforms the data by replacing continuous values with their respective bin labels (numeric).

Parameters:

X (pd.DataFrame) – The feature data.

Returns:

The transformed data with bin labels.

Return type:

pd.DataFrame

auto_prep.preprocessing.correlation_filtering module

class auto_prep.preprocessing.correlation_filtering.CorrelationFilter[source]

Bases: RequiredStep, Numerical

Transformer to detect highly correlated features and drop one of them. Pearsons correlation is used. Is a required step in preprocessing.

dropped_columns

List of columns that were dropped due to high correlation.

Type:

list

fit(X: DataFrame, y: Series | None = None) CorrelationFilter[source]

Identifies highly correlated features. Adds the second one from the pair to the list of columns to be dropped.

Parameters:

X (pd.DataFrame) – The input feature data.

Returns:

The fitted filter instance.

Return type:

CorrelationFilter

fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fits and transforms the data by removing correlated features. Performs fit and transform in one step.

Parameters:

X (pd.DataFrame) – The feature data.

Returns:

The transformed data.

Return type:

pd.DataFrame

to_tex() dict[source]

Returns a short description of the transformer in dictionary format.

transform(X: DataFrame, y: Series = None) DataFrame[source]

Drops all features identified as highly correlated with another feature.

Parameters:

X (pd.DataFrame) – The feature data.

Returns:

The transformed data with correlated columns removed.

Return type:

pd.DataFrame

auto_prep.preprocessing.dimention_reducing module

class auto_prep.preprocessing.dimention_reducing.PCADimentionReducer[source]

Bases: DimentionReducer

Combines data standardization and PCA with automatic selection of the number of components to preserve 95% of the variance.

fit(X: DataFrame, y: Series | None = None) PCADimentionReducer[source]

Fits PCA to the data, determining the number of components to preserve 95% of the variance.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Input data.

  • y (optional) – Target values (ignored).

Returns:

The fitted transformer.

Return type:

PCADimentionReducer

fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fits the transformer to the data and then transforms it.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Input data.

  • y (optional) – Target values (ignored).

Returns:

Transformed data.

Return type:

np.ndarray

to_tex() dict[source]

Returns a short description in form of dictionary. Keys are: name - transformer name, desc - short description, params - class parameters (if None then {}).

transform(X: DataFrame, y: Series = None) DataFrame[source]

Transforms the input data using fitted PCA.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Input data.

  • y (optional) – Target values (ignored).

Returns:

Transformed data.

Return type:

np.ndarray

class auto_prep.preprocessing.dimention_reducing.UMAPDimentionReducer[source]

Bases: DimentionReducer

Reduces the dimensionality of the data using UMAP.

fit(X: DataFrame, y: Series | None = None) UMAPDimentionReducer[source]

Fits the UMAPDimentionReducer to the data.

fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fits the transformer to the data and then transforms it.

to_tex() dict[source]

Returns a short description in form of dictionary. Keys are: name - transformer name, desc - short description, params - class parameters (if None then {}).

transform(X: DataFrame, y: Series = None) DataFrame[source]

Transforms the input data using the fitted UMAP reducer.

class auto_prep.preprocessing.dimention_reducing.VIFDimentionReducer[source]

Bases: DimentionReducer

Removes columns with high variance inflation factor (VIF > 10).

fit(X: DataFrame, y: Series | None = None) VIFDimentionReducer[source]

Fits the VIFDimentionReducer to the data, identifying columns with high VIF.

Parameters:
  • X (pd.DataFrame) – Input data.

  • y (optional) – Target values (ignored).

Returns:

The fitted transformer.

Return type:

VIFDimentionReducer

fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fits the VIFDimentionReducer to the data and then transforms it.

Parameters:
  • X (pd.DataFrame) – Input data.

  • y (optional) – Target values (ignored).

Returns:

Transformed data.

Return type:

pd.DataFrame

to_tex() dict[source]

Returns a short description in form of dictionary. Keys are: name - transformer name, desc - short description, params - class parameters (if None then {}).

transform(X: DataFrame, y: Series = None) DataFrame[source]

Removes columns with high VIF from the data.

Parameters:
  • X (pd.DataFrame) – Input data.

  • y (optional) – Target values (ignored).

Returns:

Transformed data.

Return type:

pd.DataFrame

auto_prep.preprocessing.encoding module

class auto_prep.preprocessing.encoding.ColumnEncoder[source]

Bases: RequiredStep, Categorical

Encoder for categorical features. This class applies different encoding techniques (OneHotEncoding or LabelEncoding) based on the number of unique values in each column.

For columns with less than 5 unique values, OneHotEncoder is used. For columns with 5 or more unique values, TolerantLabelEncoder is applied.

encoders

A dictionary of fitted encoders for each column.

Type:

dict

columns

A list of columns that have been encoded.

Type:

list

fit(X: DataFrame, y: Series | None = None) ColumnEncoder[source]

Fits the encoder to the categorical features in the data.

Parameters:
  • X (pd.DataFrame) – The feature data to fit the encoder to.

  • y (pd.Series, optional) – The target variable (to fit the encoder).

Returns:

The fitted encoder instance.

Return type:

ColumnEncoder

The encoder will choose between OneHotEncoder and LabelEncoder based on the number of unique values in each column. OneHotEncoder is used for columns with fewer than 5 unique values, and TolerantLabelEncoder is used for columns with 5 or more unique values.

fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fits and transforms the feature data using the encoder.

Parameters:
  • X (pd.DataFrame) – The feature data to transform.

  • y (pd.Series, optional) – The target variable (to append to the result).

Returns:

The transformed feature data, with encoded columns.

Return type:

pd.DataFrame

This method combines the fit and transform steps in one operation.

to_tex() dict[source]

Returns a short description in form of dictionary. Keys are: name - transformer name, desc - short description, params - class parameters (if None then {}).

transform(X: DataFrame, y: Series = None) DataFrame[source]

Transforms the feature data using the fitted encoders.

Parameters:
  • X (pd.DataFrame) – The feature data to transform.

  • y (pd.Series, optional) – The target variable (to append to the result).

Returns:

The transformed feature data, with encoded columns.

Return type:

pd.DataFrame

auto_prep.preprocessing.feature_selecting module

class auto_prep.preprocessing.feature_selecting.CorrelationSelector[source]

Bases: NonRequiredStep, Numerical

Transformer to select correlation_percent% (rounded to whole number) of features that are most correlated with the target variable.

selected_columns

List of selected columns based on correlation with the target.

Type:

list

fit(X: DataFrame, y: Series | None = None) CorrelationSelector[source]

Identifies the top correlation_percent% (rounded to whole value) of features most correlated with the target variable.

Parameters:

X (pd.DataFrame) – The input feature data.

Returns:

The fitted transformer instance.

Return type:

CorrelationSelector

fit_transform(X: DataFrame, y: Series) DataFrame[source]

Fits and transforms the data by selecting the top k% most correlated features. Performs fit and transform in one step.

Parameters:
  • X (pd.DataFrame) – The feature data.

  • y (pd.Series) – The target variable.

Returns:

The transformed data with selected features.

Return type:

pd.DataFrame

is_numerical() bool[source]
to_tex() dict[source]

Returns a short description of the transformer in dictionary format.

transform(X: DataFrame, y: Series = None) DataFrame[source]

Selects the top correlation_percent% of features most correlated with the target variable.

Parameters:
  • X (pd.DataFrame) – The feature data.

  • y (pd.Series, optional) – The target variable (to append to the result).

Returns:

The transformed data with only the selected top k% correlated features.

Return type:

pd.DataFrame

class auto_prep.preprocessing.feature_selecting.FeatureImportanceClassSelector[source]

Bases: FeatureImportanceSelector, Categorical

Transformer to select k% (rounded to whole number) of features that are most important according to Random Forest model for classification.

k

The percentage of top features to keep based on their importance.

Type:

float

selected_columns

List of selected columns based on feature importance.

Type:

list

fit(X: DataFrame, y: Series) FeatureImportanceClassSelector[source]

Identifies the feature importances according to the Random Forest model.

Parameters:
  • X (pd.DataFrame) – The input feature data.

  • y (pd.Series) – The target variable.

Returns:

The fitted transformer instance.

Return type:

FeatureImportanceClassificationSelector

fit_transform(X: DataFrame, y: Series) DataFrame[source]

Fits and transforms the data by selecting the top k% most important features. Performs fit and transform in one step.

Parameters:
  • X (pd.DataFrame) – The feature data.

  • y (pd.Series) – The target variable.

to_tex() dict[source]

Returns a short description of the transformer in dictionary format.

transform(X: DataFrame, y: Series = None) DataFrame[source]

Selects the top k% of features most important according to the Random Forest model.

Parameters:
  • X (pd.DataFrame) – The feature data.

  • y (pd.Series, optional) – The target variable (to append to the result).

Returns:

The transformed data with only the selected top k% important features.

Return type:

pd.DataFrame

class auto_prep.preprocessing.feature_selecting.FeatureImportanceRegressSelector[source]

Bases: FeatureImportanceSelector, Numerical

Transformer to select k% (rounded to whole number) of features that are most important according to Random Forest model for regression.

k

The percentage of top features to keep based on their importance.

Type:

float

selected_columns

List of selected columns based on feature importance.

Type:

list

fit(X: DataFrame, y: Series) FeatureImportanceRegressSelector[source]

Identifies the feature importances according to the Random Forest model.

Parameters:
  • X (pd.DataFrame) – The input feature data.

  • y (pd.Series) – The target variable.

Returns:

The fitted transformer instance.

Return type:

FeatureImportanceRegressionSelector

fit_transform(X: DataFrame, y)[source]

Fits and transforms the data by selecting the top k% most important features. Performs fit and transform in one step.

Parameters:
  • X (pd.DataFrame) – The feature data.

  • y (pd.Series) – The target variable.

to_tex() dict[source]

Returns a short description of the transformer in dictionary format.

transform(X: DataFrame, y: Series = None) DataFrame[source]

Selects the top k% of features most important according to the Random Forest model.

Parameters:
  • X (pd.DataFrame) – The feature data.

  • y (pd.Series, optional) – The target variable (to append to the result).

Returns:

The transformed data with only the selected top k% important features.

Return type:

pd.DataFrame

auto_prep.preprocessing.handler module

class auto_prep.preprocessing.handler.PreprocessingHandler[source]

Bases: object

run(X_train: DataFrame, y_train: Series, X_valid: DataFrame, y_valid: Series, task: str)[source]

Performs dataset preprocessing and scoring.

Parameters:
  • X_train (pd.DataFrame) – Training feature dataset.

  • y_train (pd.Series) – Training target dataset.

  • X_valid (pd.DataFrame) – Validation feature dataset.

  • y_valid (pd.Series) – Validation target dataset.

  • task (str) – regiression / classification

static score_pipeline_with_model(preprocessing_pipeline: Pipeline, model: BaseEstimator, score_func: callable, X_train: DataFrame, y_train: Series, X_valid: DataFrame, y_valid: Series) float[source]

Evaluates the performance of a given preprocessing pipeline with a model on validation data.

Parameters:
  • preprocessing_pipeline (Pipeline) – The preprocessing pipeline to be evaluated.

  • model (BaseEstimator) – The model to be used for scoring.

  • score_func (callable) – scoring function for model predictions and y_val.

  • X_train (pd.DataFrame) – Training feature dataset.

  • y_train (pd.Series) – Training target dataset.

  • X_valid (pd.DataFrame) – Validation feature dataset.

  • y_valid (pd.Series) – Validation target dataset.

Returns:

The score of the pipeline on the validation data.

Return type:

float

write_to_raport(raport)[source]

Writes overview section to a raport

auto_prep.preprocessing.imputing module

class auto_prep.preprocessing.imputing.NAImputer[source]

Bases: RequiredStep, NumericalCategorical

Base class for imputing missing values. Provides functionality to identify columns with missing values and determine the strategy to handle them (remove columns with >50% missing data).

numeric_features

A list of numeric feature names.

Type:

list

categorical_features

A list of categorical feature names.

Type:

list

fit(X: DataFrame, y: Series | None = None) NAImputer[source]

Identifies columns with more than 50% missing values and removes them from the dataset.

Parameters:

X (pd.DataFrame) – The input data with missing values.

Returns:

The fitted imputer instance.

Return type:

NAImputer

fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fits and transforms the input data by imputing missing values.

Parameters:

X (pd.DataFrame) – The input data.

Returns:

The transformed data with missing values imputed.

Return type:

pd.DataFrame

to_tex() dict[source]

Returns a description of the transformer in dictionary format.

transform(X: DataFrame) DataFrame[source]

Removes previously identified columns with >50% missing values.

Parameters:

X (pd.DataFrame) – The input data to transform.

Returns:

The transformed data.

Return type:

pd.DataFrame

auto_prep.preprocessing.outlier_detecting module

class auto_prep.preprocessing.outlier_detecting.OutlierDetector[source]

Bases: RequiredStep, Numerical

Performs Numerical data outlier detection

fit(X: DataFrame, y: Series | None = None) OutlierDetector[source]

Identify feature types in the dataset.

Parameters:
  • X (pd.DataFrame) – Input features.

  • y – Ignored. Exists for scikit-learn compatibility.

Returns:

Fitted transformer.

Return type:

OutlierDetector

Raises:

ValueError if non numerical column included in X.

fit_transform(X: DataFrame, y: Series = None)[source]

Fit and transform the data in one step. :param X: Input data :type X: pd.DataFrame :param y: Target data :type y: pd.Series

Returns:

Transformed data

Return type:

pd.DataFrame

to_tex() dict[source]

Returns a short description in form of dictionary. Keys are: name - transformer name, desc - short description, params - class parameters (if None then {}).

transform(X: DataFrame, y: Series = None) DataFrame[source]

Applies cleaning and transformation operations to the input data.

Parameters:
  • X (pd.DataFrame) – The input DataFrame to be cleaned and transformed.

  • y (pd.Series) – The target data.

Returns:

The cleaned and transformed DataFrame.

Return type:

pd.DataFrame

auto_prep.preprocessing.redundancy_filtering module

class auto_prep.preprocessing.redundancy_filtering.UniqueFilter[source]

Bases: RequiredStep, Categorical

Transformer to remove categorical columns 100% unique values.

dropped_columns

List of dropped columns.

Type:

list

fit(X: DataFrame, y: Series | None = None) UniqueFilter[source]

Identifies categorical columns with 100% unique values.

Parameters:

X (pd.DataFrame) – The input feature data.

Returns:

The fitted transformer instance.

Return type:

UniqueFilter

fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fits and transforms the data in one step.

Parameters:

X (pd.DataFrame) – The feature data.

Returns:

The transformed data without dropped columns.

Return type:

pd.DataFrame

to_tex() dict[source]

Returns a description of the transformer in dictionary format.

transform(X: DataFrame) DataFrame[source]

Drops the identified categorical columns with 100% unique values based on the fit method.

Parameters:

X (pd.DataFrame) – The feature data.

Returns:

The transformed data without dropped columns.

Return type:

pd.DataFrame

auto_prep.preprocessing.scaling module

class auto_prep.preprocessing.scaling.ColumnScaler(method: str = 'standard')[source]

Bases: RequiredStep, Numerical

Scaler for all numerical features. This class applies scaling technique based on users choice to all numerical features.

Available scaling methods: MinMaxScaler, StandardScaler, RobustScaler from sklearn.

scaler

fitted scaler instance.

Type:

object

PARAMS_GRID = {'method': ['standard', 'minmax', 'robust']}
fit(X: DataFrame, y: Series | None = None) ColumnScaler[source]

Fits the chosen scaler to the numerical features in the data.

Parameters:

X (pd.DataFrame) – The feature data to fit the scaler to.

Returns:

The fitted scaler instance.

Return type:

ColumnScaler

fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fits and transforms the feature data using the chosen scaler.

Parameters:
  • X (pd.DataFrame) – The feature data to transform.

  • y (pd.Series, optional) – The target variable (to append to the result).

Returns:

The transformed feature data.

Return type:

pd.DataFrame

is_numerical() bool[source]
to_tex() dict[source]

This method returns a short description of the Scaler that was used in a form of dictionary.

transform(X: DataFrame, y: Series = None) DataFrame[source]

Transforms numeric feature data using the fitted scaler.

Parameters:
  • X (pd.DataFrame) – The feature data to transform.

  • y (pd.Series, optional) – The target variable.

Returns:

The transformed feature data.

Return type:

pd.DataFrame

auto_prep.preprocessing.utils module

class auto_prep.preprocessing.utils.TolerantLabelEncoder(ignore_unknown=True, unknown_original_value='unknown', unknown_encoded_value=-1)[source]

Bases: LabelEncoder

inverse_transform(y)[source]

Transform labels back to original encoding.

Parameters:

y (ndarray of shape (n_samples,)) – Target values.

Returns:

y – Original encoding.

Return type:

ndarray of shape (n_samples,)

set_transform_request(*, column: bool | None | str = '$UNCHANGED$') TolerantLabelEncoder

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

column (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for column parameter in transform.

Returns:

self – The updated object.

Return type:

object

transform(y, column)[source]

Transform labels to normalized encoding.

Parameters:

y (array-like of shape (n_samples,)) – Target values.

Returns:

y – Labels as normalized encodings.

Return type:

array-like of shape (n_samples,)

auto_prep.preprocessing.variance_filtering module

class auto_prep.preprocessing.variance_filtering.VarianceFilter[source]

Bases: RequiredStep, Numerical

Transformer to remove numerical columns with zero variance.

dropped_columns

List of dropped columns.

Type:

list

fit(X: DataFrame, y: Series | None = None) VarianceFilter[source]

Identifies columns with zero variances and adds to dropped_columns list.

Parameters:

X (pd.DataFrame) – The input feature data.

Returns:

The fitted transformer instance.

Return type:

VarianceAndUniqueFilter

fit_transform(X: DataFrame, y: Series = None) DataFrame[source]

Fits and transforms the data in one step.

Parameters:

X (pd.DataFrame) – The feature data.

Returns:

The transformed data without dropped columns.

Return type:

pd.DataFrame

to_tex() dict[source]

Returns a description of the transformer in dictionary format.

transform(X: DataFrame, y: Series = None) DataFrame[source]

Drops the identified columns with zero variance based on the fit method.

Parameters:

X (pd.DataFrame) – The feature data.

Returns:

The transformed data without dropped columns.

Return type:

pd.DataFrame

Module contents