How to handle class-unbalanced data?

class imbalance

Data science

The majority class dominates while the minority class is underrepresented, leading models to bias their predictions toward the majority class.

Author

Kundan Kumar

Published

September 27, 2025

Class imbalance is a technique to handle the unbalanced data in the data-sets to build a reliable ML model and avoid the model for poor generalization.

Start with class weighting + resampling + proper evaluation metrics. Then move to more advanced techniques like SMOTE, focal loss, or anomaly detection if imbalance is extreme generalization.

Data-Level Approaches (Resampling Techniques)

Oversampling the minority class

Random Oversampling: duplicate minority class examples until balance is reached.
SMOTE (Synthetic Minority Over-sampling Technique): generates synthetic data points for the minority class by interpolating between nearest neighbors.
ADASYN: similar to SMOTE but focuses on generating harder-to-learn examples.

Undersampling the majority class

Random Undersampling: randomly remove majority class examples.
Cluster Centroids / Tomek Links / NearMiss: more informed undersampling to preserve useful structure.

Hybrid methods

Combine oversampling and undersampling to avoid overfitting or losing too much information.

Algorithm-Level Approaches

Class Weights / Cost-Sensitive Learning

Assign higher misclassification cost to minority class (many libraries like scikit-learn allow class_weight=‘balanced’).

Anomaly Detection / One-Class Models

Treat minority class as “rare events” and use anomaly detection approaches.

Ensemble Techniques

Use Bagging/Boosting with imbalance-aware modifications (e.g., Balanced Random Forest, EasyEnsemble, RUSBoost).

Evaluation-Level Approaches Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned). Precision: A measure of a classifiers exactness. Recall: A measure of a classifiers completeness F1 Score (or F-score): A weighted average of precision and recall. Avoid accuracy as the metric (it will be misleading).

Prefer metrics that account for imbalance:

Precision, Recall, F1-score

ROC-AUC, PR-AUC (especially useful with rare positives)

Matthews Correlation Coefficient (MCC)

Balanced Accuracy

. Data Collection & Domain Knowledge

Gather more examples of the minority class if possible (often the best long-term solution).

Use domain knowledge to engineer features that help separate classes better.

Active learning: selectively label more examples from uncertain regions.

Advanced / Modern Techniques

Generative Models (GANs, VAEs): to synthesize minority samples.

Semi-supervised or Self-supervised Learning: leverage unlabeled data to improve representation of minority class.

Focal Loss (in deep learning): gives higher weight to hard-to-classify (often minority) samples.

Course link

InnerSource and OpenSource

InnerSource: use of open source software development practices and open source-like culture (even though the software developed might still be proprietary)

When to OpenSource? Which license to use?

MIT license is the most permissive one

R package development

R packages are useful for

Reusability by other users, developers, “future me”
Robustness: well-tested, maintained for a longer period of time
Encapsulation: hidden complexitty inside the package (internal functions), stable interface exposed to the users (exported functions)

What can ben encapsulated in an R package?

R functions (internal, exported)
Tests (unit tests, testthat package)
Data (raw, processed)
Analytical code (reproducible analysis and reporting)
Text (literate programming): markdown, Rmarkdown, Quarto, Vignettes
Interactive applications (shiny)
Web APIs (with plumber)

Types of R packages: tool based, methods, analytical, web data project

Principles and tools

Reproducibility: Git (code versioning), dependencies (renv for r package dependencies, Docker for system dependencies)

Clean code

Code comments: not recommended! Better to write code in a way that does not need additional comments.

DRY: don’t repeat yourself (principle of software development), avoid copy and paste everywhere.

SRP: single-responsibility prinicple, a function should do one thing: either plot a chart, saves a file, changes variables etc, but not all.

Naming conventions

Reserve dots (.) for S3 methods (print.patient)
Reserve CamelCase for R6 classes or package names (OurPatients)
Use snake cases (all_patients) for function names and arguments, use verb noun pattern (plot_this())

Code smells

A function might be too large: break into smaller ones (e.g. could fit in one screen)

A function violates SRP: break into smaller ones, and be explicit in what result it is expected to return

A function with multiple arguments: the scenarios to be tested increase rapidly. Recommended to minimize number of critical function arguments, and break the function into smaller ones.

Bad comments in the code: drop the unnecessary, unclear, outdated comments, write code that are self-explanatory.

Development workflow

Code refactoring: change existing code without its functionality

TDD: Test-Driven Development

start with writing a new (failing) test
write code thtat passes the nenw tetst
refactor the code
and repeat

Benefits: your code is covered by tests; you think of testing scenarios first; “fail fast” - can immediately repair the code; more freedom to refactor (improve) the code.

How to test

automatically: CI/CD, after pushing Git commits
manually:
- run all unit tests in the package (Build / Test package)
- run tests in a selected test file (Run Tests)
- run a single test in Rstudio console

How to check

R CMD CHECK

Writing robust statistical software

Implement complext statistical methods such that the software is reliable, and includes appropriate testing to ensure high quality and validity and ultimately credibility of statistical analysis results.

choose the right method and understand them
solve the core implementation problem with prototype code

Need to try a few different solutions, compare and select the best one. Might also need to involve domain experts.

spend enough time on planning the design of the R package

Don’t write the package right away; instead define the scope, discuss with users, and design the package.

Start to draw a flow diagram, align names, arguments and classes; write prototype code.

assume the package will evolve over time

Packages you depend on will change; users will require new features

Write tests

unit tests
integration tests

Make the package extensible

consider object oriented package designs
combine functions in pipelines

Keep it manageable

avoid too many arguments
avoid too large functions

CI/CD for R packages

Continuous Integration: tests code changes to maintain the integrity of the codebase

Continuous Delivery: deploy artifacts (such as an R package) to target systems

Key components

Dependency management

Install dependencies (system/OS level; R packages)

Set repos (can be specified in options()) to e.g. CRAN, BioConductor
renv
container with dependencies pre-installed

Static code analysis

Linting (for programmatic and syntax errors) via lintr package
Code style enforcement via styler package
Spell checks identifies misspelled words in vignettes, docs and R code via spelling package

Testing

R CMD build builds R packages as a installable artifact
R CMD check runs 20+ checks including unit tests, reports errors, warnigns and notes
Test coverage reports with covr, checks how many lines of code are covered with tests
R CMD INSTALL tests R package installation

Documentation

Auto-generated docs via Roxygen and pkgdown

Release and deployments

Release artifacts and deployments to target systems

Changelog (features, bug fixes) in the NEWS.md
Release: create the package with R CMD build. Validation report with thevalidatoR
Publishing: CRAN, BioConductor