Practical Issues In Data Science Part 1: Imbalance Issue

11 min readMar 9, 2021

This article is mainly summarizing some of the points that is currently important for me from paper “Learning from Imbalanced Data”. For those who are interested, the original paper is highly recommended to read because it provides a very good start for learning the imbalance issue and covers much more topics and details than I included in this article.

In this article, three main topics will be discussed:

Nature of Problem.
Techniques.
Assessment Metrics.

Figure 1: The main contents in this article

Let’s dive in without further ado!

1. Nature of the problem

In this topic, few important concepts are:

Difference between intrinsic imbalance and extrinsic imbalance:

Intrinsic: The imbalance is a direct result of the nature of the dataspace. For example, biomedical applications.
Extrinsic: The imbalance is resulted due to other factors such as time and storage.

2. Difference between relative imbalance and imbalance due to rare instances:

Relative imbalance: The sample size of minority class is not necessarily small in its own right but is considered small relative to majority class.
Imbalance due to rare instances: The examples of minority class are very limited, resulting in lack of representative data.

3. Dataset complexity is the primary determining factor of performance deterioration. However, the situation would be amplified by the addition of a relative imbalance.

4. Difference between between-class imbalance and within-class imbalance:

Between-class imbalance: One class severely out-presents another.
Within-class imbalance: The samples from a single class may have different concepts/properties. Within-class imbalance happens when the examples of subconcept are very limited.

Figure 2: Examples of different types of imbalance. Reference: https://ieeexplore.ieee.org/document/5128907

In Figure 2, between-class imbalance happens between:

(A+D) and (B+C) where (A+D) is the majority class with different concepts and (B+C) is the minority class with different concepts.

whereas within-class imbalance happens between:

A and D
B and C

Also, imbalance due to rare instances happens in C.

2. Techniques

2.1. Sampling methods

The techniques that perform modification of the imbalanced dataset by some mechanisms in order to provide a balanced distribution.

2.1.1. Random

2.1.1.1. Random oversampling

Adding a set of samples from the minority class by augmenting the randomly selected minority examples randomly.

Problem: Multiple instances of certain examples become ‘tied’, leading to overfitting.

2.1.1.2. Random undersampling

Randomly remove a set of majority examples.

Problem: May cause the classifier to miss important concepts pertaining to the majority class.

2.1.2. Informed undersampling

Objective: to overcome the deficiency of information loss introduced in the traditional random undersampling method.

2.1.2.1. EasyEnsemble

Independently sampling several subsets (with replacement) from majority class. Then, multiple classifiers is developed based on the combination of each subset (from majority class) with the minority class data.

2.1.2.2. BalanceCascade

Systematically select which majority class examples to be undersampled by removing the majority class examples that are correctly classified in the previous step from the pool so that these examples would not be selected in the current and subsequent steps.

2.1.2.3. KNN based

2.1.2.3.1. NearMiss-1

Select the majority examples that have the smallest average distance to the 3 closest minority class examples.

2.1.2.3.2. NearMiss-2

Select the majority examples that have the smallest average distance to the 3 farthest minority class examples.

2.1.2.3.3. NearMiss-3

Select a given number of closest majority examples for each minority example to guarantee that every minority example is surrounded by some majority examples.

2.1.2.3.4. ‘Most distance’

Select the majority class examples that have the largest average distance to the 3 closest minority class examples.

2.1.3. Synthetic sampling with data generation

Objective: To break the ties introduced by simple oversampling.

Figure 3: SMOTE operation. Reference: https://ieeexplore.ieee.org/document/5128907

As shown in Figure 3, synthetic minority oversampling technique (SMOTE) creates artificial data based on the feature space similarities between existing minority examples.

In general, for any given minority class example, the technique will randomly select one of the KNN minority class example, and generate synthetic instance based on the equation below:

Limitation: Over generalization and variance. SMOTE generates the same number of synthetic instances for each original minority class example without considering the neighbouring examples. This increases the occurrence of overlapping between classes.

2.1.4. Adaptive synthetic sampling

Objective: To overcome the limitations of SMOTE.

2.1.4.1. Borderline-SMOTE

Only those minority class instances that have more majority class neighbors than minority class neighbors are selected to form the set ‘DANGER’. The instances in DANGER represent the borderline minority class examples (i.e. the examples that are most likely to be misclassified).

The DANGER set is then fed to the SMOTE algorithm to generate synthetic minority samples in the neighborhood of the borders. By doing so, only those minority class instances that are ‘closer’ to the border is used to generate synthetic instances.

2.1.4.2. ADA-SYN

Adaptively generate different amounts of synthetic data according to the distribution of each minority class instance.

In general, the number of samples to be generated from a minority class instance is depending on the proportion of majority class instances existed in KNN from that minority class instance.

More samples will be generated for those minority class instances that have a high proportion of majority class instances in KNN. By doing so, ADA-SYN is adaptively changing the weights of different minority class examples to compensate for the skewed distributions.

2.1.5. Sampling with data cleaning techniques

2.1.5.1. Tomek links

Tomek links are the pairs of minimally distanced nearest neighbors of the opposite class. If there is Tomek link between two instances, it shows either of the two facts:

One of these instances is noise.
Both instances are near a border between classes.

Therefore, Tomek link can be used to remove unwanted overlapping between classes after synthetic sampling.

By doing so, the distributions of class clusters are better defined, leading to improved classification performance.

2.1.5.2. Edited nearest neighbor (ENN)

Remove instances that differ from two of its three nearest neighbors.

2.1.5.3. Others

Some other cleaning techniques:

One-sided selection (OSS)
Condensed nearest neighbor rule

Some of the works that performing oversampling and then cleaning techniques are:

SMOTE+ENN
SMOTE+Tomek Links

2.1.6. Cluster-based sampling method

Advantage: Can be tailored to target very specific problems.

2.1.6.1. Cluster-based oversampling (CBO) algorithm

Objective: To deal with the within-class imbalance problem first and then between-class imbalance problem.

Step 1: Take a random set of K examples from each cluster (for both classes) and define the cluster centers by computing the mean feature vector of these examples.

Step 2: The remaining training examples are presented one at a time and are assigned to the cluster that exhibits the smallest distance vector magnitude.

Step 3: The cluster mean is updated after adding each training example. The steps are repeated until all examples are exhausted (i.e. only one cluster mean is essentially updated for each example).

Step 4: Oversample all majority class clusters except the clusters that have the largest number of examples so that all majority class clusters are of the same size as the largest number.

Step 5: Determine the total number of majority class examples after the oversampling process.

Step 6: Oversample the minority class clusters so that the total number of minority class examples are of the same size as the total number of majority class examples.

2.1.7. Integration of sampling and boosting

2.1.7.1. SMOTE with Adaboost.M2.

Introduce synthetic sampling at each boosting iteration. By doing so, each successive classifier ensemble focuses more on the minority class examples (that have poor prediction in previous iteration).

2.1.7.2. DataBoost-IM

Synthetic samples are generated according to the ratio of difficult-to-learn samples between classes.

In general, each example from the training set is weighted to represent the relative difficulty of learning. Then, the top-k highest examples are selected to perform oversampling. These generated samples are basically a collection of the hard-to-learn samples from both classes.

However, since minority class examples are generally more difficult to learn, it is expected that the number of minority class samples are more than the number of majority class samples.

7.3. Jittering Over Undersampling-boost (JOUS-Boost)

Objective: To break the ties as introduced by random oversampling.

Introduce perturbations (‘jittering’) to the duplicated samples generated from oversampling. By doing so, this introduces independently and identically distributed noise at each iteration of boosting.

Advantage: Simple and fast.

2.2. Cost-sensitive methods

They can mitigate the imbalance issue by considering different cost matrices that describe the costs for misclassifying any example.

Typically, the cost of misclassifying minority class examples is highest, followed by the cost of misclassifying majority class examples. There is no cost for correct classification of both classes. The objective here is to minimize the overall training costs, which is usually the Bayes conditional risk.

Limitation: Only available when there is a cost matrix and its associated cost items.

Three main categories from cost-sensitive methods:

Cost-sensitive dataspace weighting with adaptive boosting

Cost-sensitive bootstrap sampling methods where misclassification costs are applied to obtain the best training distribution for induction. (Translation theorem)

2. Cost-sensitive classifiers

Cost-sensitive classifiers are developed by integrating standard learning algorithms with ensemble methods (Metacost framework).

3. Kernel-based methods.

Will not cover here.

2.2.1. Cost-sensitive dataspace weighting with adaptive boosting

Cost item of each example is introduced into the adaptive boosting equation, which affects the training distribution of the next iteration.

As a consequence, the probability of sampling a costly example will be increased, giving the classifier a higher opportunity to learn from the more costly instances.

The algorithms are:

AdaC1
AdaC2
AdaC3
AdaCost: Perform cost-adjustment function that increases the weights of high-cost misclassification cases aggressively and decreases the weights of costly examples that are correctly classified conservatively.

2.2.2. Cost-sensitive classifiers

2.2.2.1. Cost-sensitive decision trees

Three methods to embed cost-sensitive concept into decision trees:

Applied to the decision threshold.
Applied to the split criteria at each node.
Applied to pruning schemes.

Applied to the decision threshold.

ROC is used to determine the optimal decision threshold.

Applied to the split criteria at each node.

Gini, Entropy, and DKM are used as the cost sensitive function in the split criteria.

In general, DKM function produces smaller unpruned decision trees with promising performance.

Applied to pruning schemes.

Pruning is a process to remove leaves with class probability estimates below a specified threshold. It helps to improve the overall generalization.

Pruning tends to remove the leaves describing the minority class concept. However, research has also shown that the unpruned trees on the imbalanced dataset does not improve performance. In other words, there is no effect with and without pruning the trees when it comes to an imbalanced dataset.

Hence, researchers are focusing on improving the class probability estimate at each node to develop a more representative decision tree structure. Some of the works are the Laplace smoothing method of the probability estimate and the Laplace pruning technique.

2.2.2.2. Cost-sensitive neural networks

There are basically four approaches to embed cost-sensitive concept into neural networks:

Applied to the probabilistic estimate.
Applied to the outputs of neural networks.
Applied to the learning rate.
Change the original error minimization function (i.e. the cost function) into expected cost minimization function.

Applied to the probabilistic estimate.

The probability estimate of the neural network is adaptively modified during the testing stage.

Applied to the outputs of neural networks.

The outputs of the neural network are altered during the training stage so that the neural network is forced to focus more on the high-cost class.

Applied to the learning rate.

Assign smaller learning rate on high-cost examples and larger learning rate on low-cost examples. By doing so, it introduces a very effective training and yields improvements over the base classifier.

Change the original error minimization function (i.e. the cost function) into expected cost minimization function.

Most dominant.

2.3. Additional Methods

2.3.1. One-class learning

One-class learning is advantageous in handling extremely imbalanced datasets with high feature space dimensionality. Some of the works are:

The autoencoder can be used to reconstruct the positive class at the output layer.
Novelty detection methods are appropriate to be applied to severe imbalanced datasets, whereas regular discrimination-based inductive classifiers are appropriate for handling considerably moderate imbalanced learning.
Mahalanobis-Taguchi System (MTS), which it learns by developing a continuous measurement scale using single-class examples, is effective when dealing with skewed data distribution.
A probabilistic thresholding method based on Chebyshev’s theorem is applied to determine the optimal threshold for MTS classification.

2.3.2. Small sample size and imbalanced learning

Rank metrics are used as the training to deal with imbalanced data with small sample sizes and high dimensionality by emphasizing on distinguishing classes themselves instead of the internal structure of classes. Also, model selection criteria are used instead of the traditional accuracy metric.
Multi-task learning methodology that utilizes a shared representation of the data to train extra task models related to the main task. By doing so, the effective size of the underrepresented class is amplified by adding extra training information to the data.

3. Assessment metrics for imbalanced learning

The evaluation metrics used for imbalanced learning are:

Precision
Recall
F-score
G-mean
Precision-Recall (PR) curves

3.1. Precision

Precision describes how many are correctly labelled from the examples that are labelled as positive.

3.2. Recall

Recall describes how many positive labelled examples are actually correct.

3.3. F-score

F-score combines both precision and recall that their contributions can be adjusted based on coefficient.

The default beta is 1, which gives the same contribution to both recall and precision. A lower beta (e.g. 0.5) gives more weight to precision; whereas higher beta (e.g. 2) gives more weight to recall.

3.4. G-mean

G-mean focuses on evaluating the degree of inductive bias in terms of a ratio of positive accuracy and negative accuracy.

3.5. Difference between Receiver operating characteristics (ROC) curve and precision-recall (PR) curves

PR curve is a better option when evaluating the performance of model with imbalanced dataset. The difference between ROC and PR curve are stated in Table 1.

Table 1: Difference between ROC curve and PR Curve

Source Code Example:

https://github.com/JaniceKhor/imbalance.git

References

H. He and E. A. Garcia, “Learning from Imbalanced Data,” in IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, Sept. 2009, doi: 10.1109/TKDE.2008.239.