Additional Notes#

Implementation Details#

Class Imbalance Handling:

The ET-COME method addresses class imbalance through three integrated modules:

  1. Module A (Epistemic Admissibility): Identifies learnable uncertainty regions using information-theoretic decomposition

  2. Module B (Risk-Targeted Transport): Uses optimal transport to move synthetic mass toward regions where the ensemble is genuinely confused

  3. Module C (Conformal Screening): Filters synthetic points using OOB-based conformal prediction thresholds

Algorithm Steps:

  1. Clean training data with ENN, build HNSW graph, train initial ensemble E⁰

  2. For each iteration: - Decompose OOB entropy into epistemic and aleatoric components (Module A) - Score admissible minority nodes by F1 gradient (Module B) - Solve entropy-regularized optimal transport problem (Module B) - Screen candidates against E⁰ OOB consistency threshold (Module C) - Add accepted synthetic points to training set - Increment ensemble with new trees on augmented data

  3. Stop when transport plan and OOB interval width both converge

Performance Considerations:

  • Training time scales linearly with number of iterations

  • Memory usage depends on oversampling ratio and dataset size

  • Recommendation: Start with 3-5 iterations, tune based on validation performance

Hyperparameter Tuning:

Key hyperparameters to tune: - iterations: Number of refinement iterations (default: 5) - n_neighbors: Neighborhood size for SMOTE (default: 5) - sampling_strategy: Oversampling ratio (default: ‘auto’) - classifier: Base classifier and its parameters

Validation and Evaluation#

Metrics:

Due to class imbalance, standard accuracy is not recommended. Instead, use:

  • F1-score

  • Balanced Accuracy

  • Precision-Recall AUC

  • G-Mean

Cross-Validation:

Always use stratified k-fold cross-validation to maintain class distribution in folds.

Warning

Do not use standard accuracy as the primary evaluation metric on imbalanced datasets.

Memory and Runtime#

Memory Usage: - Oversampling ratio significantly impacts memory - Consider dataset size and available memory when setting sampling_strategy - Option to use verbose=False to reduce intermediate data storage

Runtime Optimization: - Use n_jobs=-1 with scikit-learn classifiers for parallel processing - Consider smaller datasets for hyperparameter tuning before full experiments - Pre-allocate arrays where possible in custom implementations

Limitations#

  1. Binary Classification: Currently optimized for binary classification

  2. Feature Types: Works best with numerical features

  3. Small Datasets: May require more careful hyperparameter tuning

  4. Categorical Features: Recommend encoding before use

Future Work#

  • Multiclass classification support

  • Automated hyperparameter tuning

  • Production-ready optimization

  • GPU acceleration for large datasets

  • Support for categorical features with native encoding

Troubleshooting#

Issue: Poor performance compared to baseline

  • Check that cross-validation is stratified

  • Verify that appropriate metrics are used (not accuracy)

  • Try increasing number of iterations

  • Ensure base classifier is appropriate for your data

Issue: Training time very long

  • Reduce number of iterations

  • Use smaller validation set

  • Consider reducing dataset size for tuning

  • Use parallel processing (n_jobs=-1)

Issue: Memory errors

  • Reduce oversampling ratio

  • Use smaller batch sizes

  • Consider sampling the training data

  • Monitor memory usage during iterations