auto_prep.preprocessing package
Submodules
auto_prep.preprocessing.abstract module
- class auto_prep.preprocessing.abstract.DimentionReducer[source]
Bases:
NonRequiredStep
,Numerical
,ABC
Abstract class for dimensionality reduction techniques.
- abstract fit(X: DataFrame, y: Series | None = None) DimentionReducer [source]
- abstract fit_transform(X: DataFrame, y: Series = None) DataFrame [source]
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- class auto_prep.preprocessing.abstract.FeatureImportanceSelector(k: float = 10.0)[source]
Bases:
NonRequiredStep
Transformer to select k% (rounded to whole number) of features that are most important according to Random Forest model.
- k
The percentage of top features to keep based on their importance.
- Type:
float
- selected_columns
List of selected columns based on feature importance.
- Type:
list
- fit(X: DataFrame, y: Series | None = None) FeatureImportanceSelector [source]
Identifies the top k% (rounded to whole value) of features most important according to the model.
- Parameters:
X (pd.DataFrame) – The input feature data.
y (pd.Series) – The target variable.
- Returns:
The fitted transformer instance.
- Return type:
- fit_transform(X: DataFrame, y: Series) DataFrame [source]
Fits and transforms the data by selecting the top k% most important features. Performs fit and transform in one step.
- Parameters:
X (pd.DataFrame) – The feature data.
y (pd.Series) – The target variable.
- Returns:
The transformed data with selected features.
- Return type:
pd.DataFrame
- transform(X: DataFrame, y: Series = None) DataFrame [source]
Selects the top k% of features most important according to the model.
- Parameters:
X (pd.DataFrame) – The feature data.
y (pd.Series, optional) – The target variable (to append to the result).
- Returns:
The transformed data with only the selected top k% important features.
- Return type:
pd.DataFrame
auto_prep.preprocessing.binning module
- class auto_prep.preprocessing.binning.BinningTransformer(binning_method: str = 'qcut')[source]
Bases:
NonRequiredStep
,Numerical
Transformer for performing binning (using qcut) or equal-width binning (using cut) on continuous variables and replacing the values with numeric labels, but only if the number of unique values exceeds 50% of the number of samples in the column.
- threshold
percent of unique values in a column in order to classify for binning. Default : 0.5
- Type:
float
- should_bin
dictionary to track which columns should be binned.
- Type:
dict
- bin_edges
dictionary to store the bin edged for each column.
- Type:
dict
- PARAMS_GRID = {'binning_method': ['qcut', 'cut']}
- fit(X: DataFrame, y: Series | None = None) BinningTransformer [source]
Fits the transformer by calculating the bin edges for each continuous column if the number of unique values exceeds the threshold of 50%.
- Parameters:
X (pd.DataFrame) – The input feature data.
- Returns:
The fitted transformer instance.
- Return type:
- fit_transform(X: DataFrame, y: Series = None) DataFrame [source]
Fits and transforms the data in one step.
- Parameters:
X (pd.DataFrame) – The feature data.
- Returns:
The transformed data with bin labels.
- Return type:
pd.DataFrame
auto_prep.preprocessing.correlation_filtering module
- class auto_prep.preprocessing.correlation_filtering.CorrelationFilter[source]
Bases:
RequiredStep
,Numerical
Transformer to detect highly correlated features and drop one of them. Pearsons correlation is used. Is a required step in preprocessing.
- dropped_columns
List of columns that were dropped due to high correlation.
- Type:
list
- fit(X: DataFrame, y: Series | None = None) CorrelationFilter [source]
Identifies highly correlated features. Adds the second one from the pair to the list of columns to be dropped.
- Parameters:
X (pd.DataFrame) – The input feature data.
- Returns:
The fitted filter instance.
- Return type:
auto_prep.preprocessing.dimention_reducing module
- class auto_prep.preprocessing.dimention_reducing.PCADimentionReducer[source]
Bases:
DimentionReducer
Combines data standardization and PCA with automatic selection of the number of components to preserve 95% of the variance.
- fit(X: DataFrame, y: Series | None = None) PCADimentionReducer [source]
Fits PCA to the data, determining the number of components to preserve 95% of the variance.
- Parameters:
X (pd.DataFrame or np.ndarray) – Input data.
y (optional) – Target values (ignored).
- Returns:
The fitted transformer.
- Return type:
- fit_transform(X: DataFrame, y: Series = None) DataFrame [source]
Fits the transformer to the data and then transforms it.
- Parameters:
X (pd.DataFrame or np.ndarray) – Input data.
y (optional) – Target values (ignored).
- Returns:
Transformed data.
- Return type:
np.ndarray
- class auto_prep.preprocessing.dimention_reducing.UMAPDimentionReducer[source]
Bases:
DimentionReducer
Reduces the dimensionality of the data using UMAP.
- fit(X: DataFrame, y: Series | None = None) UMAPDimentionReducer [source]
Fits the UMAPDimentionReducer to the data.
- fit_transform(X: DataFrame, y: Series = None) DataFrame [source]
Fits the transformer to the data and then transforms it.
- class auto_prep.preprocessing.dimention_reducing.VIFDimentionReducer[source]
Bases:
DimentionReducer
Removes columns with high variance inflation factor (VIF > 10).
- fit(X: DataFrame, y: Series | None = None) VIFDimentionReducer [source]
Fits the VIFDimentionReducer to the data, identifying columns with high VIF.
- Parameters:
X (pd.DataFrame) – Input data.
y (optional) – Target values (ignored).
- Returns:
The fitted transformer.
- Return type:
- fit_transform(X: DataFrame, y: Series = None) DataFrame [source]
Fits the VIFDimentionReducer to the data and then transforms it.
- Parameters:
X (pd.DataFrame) – Input data.
y (optional) – Target values (ignored).
- Returns:
Transformed data.
- Return type:
pd.DataFrame
auto_prep.preprocessing.encoding module
- class auto_prep.preprocessing.encoding.ColumnEncoder[source]
Bases:
RequiredStep
,Categorical
Encoder for categorical features. This class applies different encoding techniques (OneHotEncoding or LabelEncoding) based on the number of unique values in each column.
For columns with less than 5 unique values, OneHotEncoder is used. For columns with 5 or more unique values, TolerantLabelEncoder is applied.
- encoders
A dictionary of fitted encoders for each column.
- Type:
dict
- columns
A list of columns that have been encoded.
- Type:
list
- fit(X: DataFrame, y: Series | None = None) ColumnEncoder [source]
Fits the encoder to the categorical features in the data.
- Parameters:
X (pd.DataFrame) – The feature data to fit the encoder to.
y (pd.Series, optional) – The target variable (to fit the encoder).
- Returns:
The fitted encoder instance.
- Return type:
The encoder will choose between OneHotEncoder and LabelEncoder based on the number of unique values in each column. OneHotEncoder is used for columns with fewer than 5 unique values, and TolerantLabelEncoder is used for columns with 5 or more unique values.
- fit_transform(X: DataFrame, y: Series = None) DataFrame [source]
Fits and transforms the feature data using the encoder.
- Parameters:
X (pd.DataFrame) – The feature data to transform.
y (pd.Series, optional) – The target variable (to append to the result).
- Returns:
The transformed feature data, with encoded columns.
- Return type:
pd.DataFrame
This method combines the fit and transform steps in one operation.
- to_tex() dict [source]
Returns a short description in form of dictionary. Keys are: name - transformer name, desc - short description, params - class parameters (if None then {}).
- transform(X: DataFrame, y: Series = None) DataFrame [source]
Transforms the feature data using the fitted encoders.
- Parameters:
X (pd.DataFrame) – The feature data to transform.
y (pd.Series, optional) – The target variable (to append to the result).
- Returns:
The transformed feature data, with encoded columns.
- Return type:
pd.DataFrame
auto_prep.preprocessing.feature_selecting module
- class auto_prep.preprocessing.feature_selecting.CorrelationSelector[source]
Bases:
NonRequiredStep
,Numerical
Transformer to select correlation_percent% (rounded to whole number) of features that are most correlated with the target variable.
- selected_columns
List of selected columns based on correlation with the target.
- Type:
list
- fit(X: DataFrame, y: Series | None = None) CorrelationSelector [source]
Identifies the top correlation_percent% (rounded to whole value) of features most correlated with the target variable.
- Parameters:
X (pd.DataFrame) – The input feature data.
- Returns:
The fitted transformer instance.
- Return type:
- fit_transform(X: DataFrame, y: Series) DataFrame [source]
Fits and transforms the data by selecting the top k% most correlated features. Performs fit and transform in one step.
- Parameters:
X (pd.DataFrame) – The feature data.
y (pd.Series) – The target variable.
- Returns:
The transformed data with selected features.
- Return type:
pd.DataFrame
- transform(X: DataFrame, y: Series = None) DataFrame [source]
Selects the top correlation_percent% of features most correlated with the target variable.
- Parameters:
X (pd.DataFrame) – The feature data.
y (pd.Series, optional) – The target variable (to append to the result).
- Returns:
The transformed data with only the selected top k% correlated features.
- Return type:
pd.DataFrame
- class auto_prep.preprocessing.feature_selecting.FeatureImportanceClassSelector[source]
Bases:
FeatureImportanceSelector
,Categorical
Transformer to select k% (rounded to whole number) of features that are most important according to Random Forest model for classification.
- k
The percentage of top features to keep based on their importance.
- Type:
float
- selected_columns
List of selected columns based on feature importance.
- Type:
list
- fit(X: DataFrame, y: Series) FeatureImportanceClassSelector [source]
Identifies the feature importances according to the Random Forest model.
- Parameters:
X (pd.DataFrame) – The input feature data.
y (pd.Series) – The target variable.
- Returns:
The fitted transformer instance.
- Return type:
FeatureImportanceClassificationSelector
- fit_transform(X: DataFrame, y: Series) DataFrame [source]
Fits and transforms the data by selecting the top k% most important features. Performs fit and transform in one step.
- Parameters:
X (pd.DataFrame) – The feature data.
y (pd.Series) – The target variable.
- transform(X: DataFrame, y: Series = None) DataFrame [source]
Selects the top k% of features most important according to the Random Forest model.
- Parameters:
X (pd.DataFrame) – The feature data.
y (pd.Series, optional) – The target variable (to append to the result).
- Returns:
The transformed data with only the selected top k% important features.
- Return type:
pd.DataFrame
- class auto_prep.preprocessing.feature_selecting.FeatureImportanceRegressSelector[source]
Bases:
FeatureImportanceSelector
,Numerical
Transformer to select k% (rounded to whole number) of features that are most important according to Random Forest model for regression.
- k
The percentage of top features to keep based on their importance.
- Type:
float
- selected_columns
List of selected columns based on feature importance.
- Type:
list
- fit(X: DataFrame, y: Series) FeatureImportanceRegressSelector [source]
Identifies the feature importances according to the Random Forest model.
- Parameters:
X (pd.DataFrame) – The input feature data.
y (pd.Series) – The target variable.
- Returns:
The fitted transformer instance.
- Return type:
FeatureImportanceRegressionSelector
- fit_transform(X: DataFrame, y)[source]
Fits and transforms the data by selecting the top k% most important features. Performs fit and transform in one step.
- Parameters:
X (pd.DataFrame) – The feature data.
y (pd.Series) – The target variable.
- transform(X: DataFrame, y: Series = None) DataFrame [source]
Selects the top k% of features most important according to the Random Forest model.
- Parameters:
X (pd.DataFrame) – The feature data.
y (pd.Series, optional) – The target variable (to append to the result).
- Returns:
The transformed data with only the selected top k% important features.
- Return type:
pd.DataFrame
auto_prep.preprocessing.handler module
- class auto_prep.preprocessing.handler.PreprocessingHandler[source]
Bases:
object
- run(X_train: DataFrame, y_train: Series, X_valid: DataFrame, y_valid: Series, task: str)[source]
Performs dataset preprocessing and scoring.
- Parameters:
X_train (pd.DataFrame) – Training feature dataset.
y_train (pd.Series) – Training target dataset.
X_valid (pd.DataFrame) – Validation feature dataset.
y_valid (pd.Series) – Validation target dataset.
task (str) – regiression / classification
- static score_pipeline_with_model(preprocessing_pipeline: Pipeline, model: BaseEstimator, score_func: callable, X_train: DataFrame, y_train: Series, X_valid: DataFrame, y_valid: Series) float [source]
Evaluates the performance of a given preprocessing pipeline with a model on validation data.
- Parameters:
preprocessing_pipeline (Pipeline) – The preprocessing pipeline to be evaluated.
model (BaseEstimator) – The model to be used for scoring.
score_func (callable) – scoring function for model predictions and y_val.
X_train (pd.DataFrame) – Training feature dataset.
y_train (pd.Series) – Training target dataset.
X_valid (pd.DataFrame) – Validation feature dataset.
y_valid (pd.Series) – Validation target dataset.
- Returns:
The score of the pipeline on the validation data.
- Return type:
float
auto_prep.preprocessing.imputing module
- class auto_prep.preprocessing.imputing.NAImputer[source]
Bases:
RequiredStep
,NumericalCategorical
Base class for imputing missing values. Provides functionality to identify columns with missing values and determine the strategy to handle them (remove columns with >50% missing data).
- numeric_features
A list of numeric feature names.
- Type:
list
- categorical_features
A list of categorical feature names.
- Type:
list
- fit(X: DataFrame, y: Series | None = None) NAImputer [source]
Identifies columns with more than 50% missing values and removes them from the dataset.
- Parameters:
X (pd.DataFrame) – The input data with missing values.
- Returns:
The fitted imputer instance.
- Return type:
auto_prep.preprocessing.outlier_detecting module
- class auto_prep.preprocessing.outlier_detecting.OutlierDetector[source]
Bases:
RequiredStep
,Numerical
Performs Numerical data outlier detection
- fit(X: DataFrame, y: Series | None = None) OutlierDetector [source]
Identify feature types in the dataset.
- Parameters:
X (pd.DataFrame) – Input features.
y – Ignored. Exists for scikit-learn compatibility.
- Returns:
Fitted transformer.
- Return type:
- Raises:
ValueError if non numerical column included in X. –
- fit_transform(X: DataFrame, y: Series = None)[source]
Fit and transform the data in one step. :param X: Input data :type X: pd.DataFrame :param y: Target data :type y: pd.Series
- Returns:
Transformed data
- Return type:
pd.DataFrame
- to_tex() dict [source]
Returns a short description in form of dictionary. Keys are: name - transformer name, desc - short description, params - class parameters (if None then {}).
- transform(X: DataFrame, y: Series = None) DataFrame [source]
Applies cleaning and transformation operations to the input data.
- Parameters:
X (pd.DataFrame) – The input DataFrame to be cleaned and transformed.
y (pd.Series) – The target data.
- Returns:
The cleaned and transformed DataFrame.
- Return type:
pd.DataFrame
auto_prep.preprocessing.redundancy_filtering module
- class auto_prep.preprocessing.redundancy_filtering.UniqueFilter[source]
Bases:
RequiredStep
,Categorical
Transformer to remove categorical columns 100% unique values.
- dropped_columns
List of dropped columns.
- Type:
list
- fit(X: DataFrame, y: Series | None = None) UniqueFilter [source]
Identifies categorical columns with 100% unique values.
- Parameters:
X (pd.DataFrame) – The input feature data.
- Returns:
The fitted transformer instance.
- Return type:
auto_prep.preprocessing.scaling module
- class auto_prep.preprocessing.scaling.ColumnScaler(method: str = 'standard')[source]
Bases:
RequiredStep
,Numerical
Scaler for all numerical features. This class applies scaling technique based on users choice to all numerical features.
Available scaling methods: MinMaxScaler, StandardScaler, RobustScaler from sklearn.
- scaler
fitted scaler instance.
- Type:
object
- PARAMS_GRID = {'method': ['standard', 'minmax', 'robust']}
- fit(X: DataFrame, y: Series | None = None) ColumnScaler [source]
Fits the chosen scaler to the numerical features in the data.
- Parameters:
X (pd.DataFrame) – The feature data to fit the scaler to.
- Returns:
The fitted scaler instance.
- Return type:
- fit_transform(X: DataFrame, y: Series = None) DataFrame [source]
Fits and transforms the feature data using the chosen scaler.
- Parameters:
X (pd.DataFrame) – The feature data to transform.
y (pd.Series, optional) – The target variable (to append to the result).
- Returns:
The transformed feature data.
- Return type:
pd.DataFrame
auto_prep.preprocessing.utils module
- class auto_prep.preprocessing.utils.TolerantLabelEncoder(ignore_unknown=True, unknown_original_value='unknown', unknown_encoded_value=-1)[source]
Bases:
LabelEncoder
- inverse_transform(y)[source]
Transform labels back to original encoding.
- Parameters:
y (ndarray of shape (n_samples,)) – Target values.
- Returns:
y – Original encoding.
- Return type:
ndarray of shape (n_samples,)
- set_transform_request(*, column: bool | None | str = '$UNCHANGED$') TolerantLabelEncoder
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
column (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
column
parameter intransform
.- Returns:
self – The updated object.
- Return type:
object
auto_prep.preprocessing.variance_filtering module
- class auto_prep.preprocessing.variance_filtering.VarianceFilter[source]
Bases:
RequiredStep
,Numerical
Transformer to remove numerical columns with zero variance.
- dropped_columns
List of dropped columns.
- Type:
list
- fit(X: DataFrame, y: Series | None = None) VarianceFilter [source]
Identifies columns with zero variances and adds to dropped_columns list.
- Parameters:
X (pd.DataFrame) – The input feature data.
- Returns:
The fitted transformer instance.
- Return type:
VarianceAndUniqueFilter