auto_prep.utils package

Submodules

auto_prep.utils.abstract module

class auto_prep.utils.abstract.Categorical[source]

Bases: ABC

Abstract interface to indicate categorical step

class auto_prep.utils.abstract.Classifier[source]

Bases: ABC

Abstract interface to indicate classification problem.

class auto_prep.utils.abstract.ModulesHandler[source]

Bases: ABC

static construct_pipelines_steps(step_name: str, module_name: str, called_from: str, pipelines: List[List[Step]] = [], required_only_: bool = False) → List[List[Step]][source]

Constructs new pipelines (list of steps) by adding steps from the provided module. The method dynamically loads and groups classes from the module, and then extends existing pipelines by adding required and/or non-required steps.

The method starts by loading and grouping classes from the module. It then explodes the existing pipelines by adding required steps. If the required_only_ flag is False, non-required steps are also added to the pipelines.

Parameters:

step_name (str) – The name of the step, used for logging purposes.
module_name (str) – The name of the module from which to load and group classes.
called_from (str) – for relative imports.
pipelines (List[List[Step]]) – A list of existing pipelines to be extended.
required_only (bool, optional) – If True, only required steps are added to the pipelines. If False, both required and non-required steps are added. Defaults to False.

Returns:

A list of new pipelines steps created by adding the corresponding required and non-required steps to the original pipelines.

Return type:

List[List[Step]]

static construct_pipelines_steps_helper(step_name: str, package_name: str, called_from: str, pipelines: List[List[Step]], required_only_: bool = False) → List[List[Step]][source]

A helper method to construct and extend pipelines steps by incorporating modules dynamically from a specified package.

This method uses the ModulesHandler.construct_pipelines function to add modules to existing pipelines based on the package’s name and the current file context. It logs the operation’s start and end using the provided logger.

Parameters:

step_name (str) – The name of the step, used for logging purposes.
package_name (str) – The name of the package containing the modules to be dynamically added to the pipelines.
called_from (str) – for relative imports.
pipelines (List[List[Step]]) – A list of existing pipelines to which new modules will be added.
required_only (bool, optional) – If True, only the required modules (determined by the package) will be added. If False, both required and non-required modules will be included. Defaults to False.

Returns:

The updated list of pipelines steps after incorporating the modules from the specified package.

Return type:

List[List[Step]]

static get_subpackage(__file__)[source]

Returns the name of the package (directory) containing the given file as relative auto_prep subpackage.

Parameters:: __file__ (str) – The absolute or relative path to the current file.
Returns:: The name of the directory containing the file, which is treated as the package name.
Return type:: str
Raises:: ValueError - if it cannot find the module –

static load_classes(module_name: str, package: str) → List[object][source]

supported_combinations: List[List[object]] = [('NumericalRequired', (<class 'auto_prep.utils.abstract.Numerical'>, <class 'auto_prep.utils.abstract.RequiredStep'>)), ('NumericalNonRequired', (<class 'auto_prep.utils.abstract.Numerical'>, <class 'auto_prep.utils.abstract.NonRequiredStep'>)), ('CategoricalRequired', (<class 'auto_prep.utils.abstract.Categorical'>, <class 'auto_prep.utils.abstract.RequiredStep'>)), ('CategoricalNonRequired', (<class 'auto_prep.utils.abstract.Categorical'>, <class 'auto_prep.utils.abstract.NonRequiredStep'>)), ('NumericalCategoricalRequired', (<class 'auto_prep.utils.abstract.NumericalCategorical'>, <class 'auto_prep.utils.abstract.RequiredStep'>)), ('NumericalCategoricalNonRequired', (<class 'auto_prep.utils.abstract.NumericalCategorical'>, <class 'auto_prep.utils.abstract.NonRequiredStep'>))]

supported_interfaces: List[object] = [<class 'auto_prep.utils.abstract.Numerical'>, <class 'auto_prep.utils.abstract.Categorical'>, <class 'auto_prep.utils.abstract.NumericalCategorical'>, <class 'auto_prep.utils.abstract.RequiredStep'>, <class 'auto_prep.utils.abstract.NonRequiredStep'>]

class auto_prep.utils.abstract.NonRequiredStep[source]

Bases: Step

Non required step that will be only considered for preprocessing.

class auto_prep.utils.abstract.Numerical[source]

Bases: ABC

Abstract interface to indicate numerical step

class auto_prep.utils.abstract.NumericalCategorical[source]

Bases: ABC

Abstract interface to indicate categorical and numerical step

class auto_prep.utils.abstract.Regressor[source]

Bases: ABC

Abstract interface to indicate regression problem.

class auto_prep.utils.abstract.RequiredStep[source]

Bases: Step

Required step that will be always considered in preprocessing.

class auto_prep.utils.abstract.Step[source]

Bases: ABC, BaseEstimator, TransformerMixin

Abstract class to be overwritten for implementing custom preprocessing steps. If step is parametrizable, it should have defined “param_grid” of all possible values for each parameter.

abstract to_tex() → dict[source]: Returns a short description in form of dictionary. Keys are: name - transformer name, desc - short description, params - class parameters (if None then {}).

auto_prep.utils.config module

class auto_prep.utils.config.GlobalConfig(*args, **kwargs)[source]

Bases: object

Global config class.

prepare_dir()[source]: Clears and creates all neccessary directories.

set(raport_name: str = 'raport', raport_title: str = 'ML Raport', raport_author: str = 'AutoPrep', raport_abstract: str = NoEscape( \begin{abstract} This raport has been generated with AutoPrep. \end{abstract} ), root_dir: str = 'raport', return_tex_: bool = True, logger_colors_map: dict = {'CRITICAL': '\x1b[41m', 'DEBUG': '\x1b[36m', 'ERROR': '\x1b[31m', 'INFO': '\x1b[32m', 'RESET': '\x1b[0m', 'WARNING': '\x1b[33m'}, log_format: str = '%(asctime)s %(levelname)s %(name)s: %(message)s', log_date_format: str = '%Y-%m-%d %H:%M:%S', log_level: str = 50, log_dir: str | None = None, max_log_file_size_in_mb: int = 5, tex_geomatry: dict = {'bmargin': '0.5in', 'footskip': '0.2in', 'headheight': '10pt', 'margin': '0.5in', 'tmargin': '0.5in'}, train_size: float = 0.8, test_size: float = 0.1, valid_size: float = 0.1, random_state: int = 42, max_datasets_after_preprocessing: int = 3, perform_only_required_: bool = False, raport_decimal_precision: int = 4, chart_settings: dict = {'heatmap_cmap': 'coolwarm', 'heatmap_fmt': '.2f', 'palette': 'pastel', 'plot_height_per_row': 8, 'plot_width': 20, 'theme': 'white', 'tick_label_rotation': 45, 'title_fontsize': 18, 'title_fontweight': 'bold', 'xlabel_fontsize': 15, 'ylabel_fontsize': 15}, correlation_selectors_settings: dict = {'k': 10, 'threshold': 0.8}, outlier_detector_settings: dict = {'cook_threshold': 1, 'isol_forest_n_estimators': 100, 'zscore_threshold': 3}, imputer_settings: dict = {'categorical_strategy': 'most_frequent', 'n_iter': 10, 'numerical_strategy': 'mean'}, umap_components: int = 50, correlation_threshold: float = 0.8, correlation_percent: float = 0.7, n_bins: int = 4, outlier_detector_method: str = 'zscore', max_unique_values_classification: int = 20, regression_pipeline_scoring_model: ~sklearn.base.BaseEstimator = RandomForestRegressor(max_depth=5, n_jobs=-1, random_state=42, warm_start=True), classification_pipeline_scoring_model: ~sklearn.base.BaseEstimator = RandomForestClassifier(max_depth=5, n_jobs=-1, random_state=42, warm_start=True), regression_pipeline_scoring_func: callable | str = (<function mean_squared_error>, 'min'), classification_pipeline_scoring_func_bin: callable | str = (<function roc_auc_score>, 'max'), classification_pipeline_scoring_func_multi: callable | str = (<function accuracy_score>, 'max'), max_workers: int | None = None, tuning_params: dict = {'cv': 3, 'n_iter': 10, 'n_jobs': -1, 'random_state': 42, 'verbose': 0}, max_models: int = 3)[source]

Parameters:

raport_name (str)
raport_title (str)
raport_title
raport_abstract (str) – Defaults to DEFAULT_ABSTRACT.
root_dir (str) – stored and all cache. Defaults to “raport”.
return_tex (bool) – alongsite the pdf. Defaults to True.
logger_colors_map (dict) – Defaults to COLORS.
log_format (str) – Defaults to LOG_FORMAT.
log_date_format (str) – Defaults to LOG_DATE_FORMAT.
log_level (str) – Defaults to LOG_LEVEL.
log_dir (str) – If None provided, will default to “logs” in directory from which program was called. -1 means no logging to file.
max_log_file_size_in_mb (int) – each logger. Defaults to 5.
tex_geomatry (dict) – Defaults to DEFAULT_TEX_GEOMETRY.
train_size (float)
test_size (float)
valid_size (float)
random_state (int)
max_datasets_after_preprocessing (int) – after preprocessing steps. On them further models will be trained. Strongly affects performance.
perform_only_required (bool) – Affects entire process.
raport_decimal_precision (int) – Will use standard python rounding.
chart_settings (dict) – Settings for customizing chart appearance. Defaults to None, which initializes default settings.
correlation_selectors_settings (dict) – Settings for correlation selectors.
outlier_detector_settings (dict) – Settings for outlier detectors
imputer_settings (dict) – Settings for imputers
umap_components (int) – Number of components for UMAP.
max_unique_values_classification (int) – it will calculate number of unique values (in task “auto”). If this number will be lower than that value, it’ll perform classification.
regression_pipeline_scoring_model (BaseEstimator) – in classification regression task.
classification_pipeline_scoring_model (BaseEstimator) – in classification regression task.
regression_pipeline_scoring_func (callable)
classification_pipeline_scoring_func (callable)
raport_chart_color_pallete (List[str])
max_unique_values_classification – it will calculate number of unique values (in task “auto”). If this number will be lower than that value, it’ll perform classification.
regression_pipeline_scoring_model – in classification regression task.
classification_pipeline_scoring_model – in classification regression task.
Union[callable (classification_pipeline_scoring_func_multi)
pair (str] -)
Union[callable
pair
Union[callable
pair
raport_chart_color_pallete
correlation_threshold (float)
correlation_percent (float)
n_bins (int)
outlier_detector_method (str)
max_workers (int)
tuning_params (dict)
max_models (int)

update(**kwargs)[source]: Updates config’s data with kwargs.

auto_prep.utils.logging_config module

class auto_prep.utils.logging_config.ColoredFormatter(fmt=None, datefmt=None, style='%', validate=True, *, defaults=None)[source]

Bases: Formatter

Custom formatter adding colors to levelname and timing information.

format(record: LogRecord) → str[source]

Formats a logging record into a string.

Parameters:: record (logging.LogRecord) – The logging record to be formatted.
Returns:: The formatted string representation of the logging record.
Return type:: str

class auto_prep.utils.logging_config.TimedLogger(name: str, level: int = 0)[source]

Bases: Logger

Logger subclass that tracks operation timing.

end_operation() → None[source]: End timing the current operation and log the elapsed time.

start_operation(operation: str) → None[source]

Initiates the timing of a specific operation.

This method marks the beginning of an operation and starts the timer. It is used to track the duration of a specific operation or task.

Parameters:: operation (str) – The name or description of the operation being timed.

auto_prep.utils.logging_config.setup_logger(name: str) → TimedLogger[source]

Sets up a logger with both file and console handlers.

Parameters:

name (str) – The name of the logger.

Returns:

An instance of the TimedLogger class, which is a subclass: of the standard Python logger that adds timing functionality.

Return type:

TimedLogger

auto_prep.utils.other module

auto_prep.utils.other.get_scoring(task: str, y_train: Series) → callable | str[source]

Retrieve proper scoring function from config.

Parameters:

y_train (pd.Series) – Training target dataset.
task (str) – regiression / classification

auto_prep.utils.other.save_chart(name: str, *args, **kwargs) → str[source]

Saves chart to directory specified in config.

Parameters:: name (str)

Returs:: path (str) - Path where chart has been saved.

auto_prep.utils.other.save_json(name: str, obj: Any) → str[source]

Saves json-like object to directory specified in config.

Parameters:: name (str)

Returs:: path (str) - Path where chart has been saved.

auto_prep.utils.other.save_model(name: str, model: BaseEstimator, *args, **kwargs) → str[source]

Saves model to directory specified in config.

Parameters:

name (str)
model (BaseEstimator)

Returs:: path (str) - Path where model has been saved.

auto_prep.utils.system module

auto_prep.utils.system.get_system_info()[source]: Collect system stats

auto_prep.utils package

Submodules

auto_prep.utils.abstract module

auto_prep.utils.config module

auto_prep.utils.logging_config module

auto_prep.utils.other module

auto_prep.utils.system module

Module contents