Additional Notes#
Implementation Details#
Class Imbalance Handling:
The ET-COME method addresses class imbalance through three integrated modules:
Module A (Epistemic Admissibility): Identifies learnable uncertainty regions using information-theoretic decomposition
Module B (Risk-Targeted Transport): Uses optimal transport to move synthetic mass toward regions where the ensemble is genuinely confused
Module C (Conformal Screening): Filters synthetic points using OOB-based conformal prediction thresholds
Algorithm Steps:
Clean training data with ENN, build HNSW graph, train initial ensemble E⁰
For each iteration: - Decompose OOB entropy into epistemic and aleatoric components (Module A) - Score admissible minority nodes by F1 gradient (Module B) - Solve entropy-regularized optimal transport problem (Module B) - Screen candidates against E⁰ OOB consistency threshold (Module C) - Add accepted synthetic points to training set - Increment ensemble with new trees on augmented data
Stop when transport plan and OOB interval width both converge
Performance Considerations:
Training time scales linearly with number of iterations
Memory usage depends on oversampling ratio and dataset size
Recommendation: Start with 3-5 iterations, tune based on validation performance
Hyperparameter Tuning:
Key hyperparameters to tune:
- iterations: Number of refinement iterations (default: 5)
- n_neighbors: Neighborhood size for SMOTE (default: 5)
- sampling_strategy: Oversampling ratio (default: ‘auto’)
- classifier: Base classifier and its parameters
Validation and Evaluation#
Metrics:
Due to class imbalance, standard accuracy is not recommended. Instead, use:
F1-score
Balanced Accuracy
Precision-Recall AUC
G-Mean
Cross-Validation:
Always use stratified k-fold cross-validation to maintain class distribution in folds.
Warning
Do not use standard accuracy as the primary evaluation metric on imbalanced datasets.
Memory and Runtime#
Memory Usage:
- Oversampling ratio significantly impacts memory
- Consider dataset size and available memory when setting sampling_strategy
- Option to use verbose=False to reduce intermediate data storage
Runtime Optimization:
- Use n_jobs=-1 with scikit-learn classifiers for parallel processing
- Consider smaller datasets for hyperparameter tuning before full experiments
- Pre-allocate arrays where possible in custom implementations
Limitations#
Binary Classification: Currently optimized for binary classification
Feature Types: Works best with numerical features
Small Datasets: May require more careful hyperparameter tuning
Categorical Features: Recommend encoding before use
Future Work#
Multiclass classification support
Automated hyperparameter tuning
Production-ready optimization
GPU acceleration for large datasets
Support for categorical features with native encoding
Troubleshooting#
Issue: Poor performance compared to baseline
Check that cross-validation is stratified
Verify that appropriate metrics are used (not accuracy)
Try increasing number of iterations
Ensure base classifier is appropriate for your data
Issue: Training time very long
Reduce number of iterations
Use smaller validation set
Consider reducing dataset size for tuning
Use parallel processing (
n_jobs=-1)
Issue: Memory errors
Reduce oversampling ratio
Use smaller batch sizes
Consider sampling the training data
Monitor memory usage during iterations