Additional Notes#

Implementation Details#

Class Imbalance Handling:

The ET-COME method addresses class imbalance through three integrated modules:

Module A (Epistemic Admissibility): Identifies learnable uncertainty regions using information-theoretic decomposition
Module B (Risk-Targeted Transport): Uses optimal transport to move synthetic mass toward regions where the ensemble is genuinely confused
Module C (Conformal Screening): Filters synthetic points using OOB-based conformal prediction thresholds

Algorithm Steps:

Clean training data with ENN, build HNSW graph, train initial ensemble E⁰
For each iteration: - Decompose OOB entropy into epistemic and aleatoric components (Module A) - Score admissible minority nodes by F1 gradient (Module B) - Solve entropy-regularized optimal transport problem (Module B) - Screen candidates against E⁰ OOB consistency threshold (Module C) - Add accepted synthetic points to training set - Increment ensemble with new trees on augmented data
Stop when transport plan and OOB interval width both converge

Performance Considerations:

Training time scales linearly with number of iterations
Memory usage depends on oversampling ratio and dataset size
Recommendation: Start with 3-5 iterations, tune based on validation performance

Hyperparameter Tuning:

Key hyperparameters to tune: - iterations: Number of refinement iterations (default: 5) - n_neighbors: Neighborhood size for SMOTE (default: 5) - sampling_strategy: Oversampling ratio (default: ‘auto’) - classifier: Base classifier and its parameters

Validation and Evaluation#

Metrics:

Due to class imbalance, standard accuracy is not recommended. Instead, use:

F1-score
Balanced Accuracy
Precision-Recall AUC
G-Mean

Cross-Validation:

Always use stratified k-fold cross-validation to maintain class distribution in folds.

Warning

Do not use standard accuracy as the primary evaluation metric on imbalanced datasets.

Memory and Runtime#

Memory Usage: - Oversampling ratio significantly impacts memory - Consider dataset size and available memory when setting sampling_strategy - Option to use verbose=False to reduce intermediate data storage

Runtime Optimization: - Use n_jobs=-1 with scikit-learn classifiers for parallel processing - Consider smaller datasets for hyperparameter tuning before full experiments - Pre-allocate arrays where possible in custom implementations

Limitations#

Binary Classification: Currently optimized for binary classification
Feature Types: Works best with numerical features
Small Datasets: May require more careful hyperparameter tuning
Categorical Features: Recommend encoding before use

Future Work#

Multiclass classification support
Automated hyperparameter tuning
Production-ready optimization
GPU acceleration for large datasets
Support for categorical features with native encoding

Troubleshooting#

Issue: Poor performance compared to baseline

Check that cross-validation is stratified
Verify that appropriate metrics are used (not accuracy)
Try increasing number of iterations
Ensure base classifier is appropriate for your data

Issue: Training time very long

Reduce number of iterations
Use smaller validation set
Consider reducing dataset size for tuning
Use parallel processing (n_jobs=-1)

Issue: Memory errors

Reduce oversampling ratio
Use smaller batch sizes
Consider sampling the training data
Monitor memory usage during iterations