Hongbo Shi* , Xin Chen* and Min Guo*Re-SSS: Rebalancing Imbalanced Data Using Safe Sample ScreeningAbstract: Different samples can have different effects on learning support vector machine (SVM) classifiers. To rebalance an imbalanced dataset, it is reasonable to reduce non-informative samples and add informative samples for learning classifiers. Safe sample screening can identify a part of non-informative samples and retain informative samples. This study developed a resampling algorithm for Rebalancing imbalanced data using Safe Sample Screening (Re-SSS), which is composed of selecting Informative Samples (Re-SSS-IS) and rebalancing via a Weighted SMOTE (Re-SSS-WSMOTE). The Re-SSS-IS selects informative samples from the majority class, and determines a suitable regularization parameter for SVM, while the Re-SSS-WSMOTE generates informative minority samples. Both Re-SSS-IS and Re-SSS-WSMOTE are based on safe sampling screening. The experimental results show that Re-SSS can effectively improve the classification performance of imbalanced classification problems. Keywords: Imbalanced Data , Safe Sample Screening , Re-SSS-IS , Re-SSS-WSMOTE 1. IntroductionRare events (such as cancer in cancer detection [1], financial fraud in financial fraud detection [2], and intrusion events in intrusion detection [3]) are usually difficult to detect owing to their relative scarcity; however, detecting rare events is more critical than detecting ordinary events in many practical problems. Detecting rare events is, in essence, the process of identifying samples in a minority class from an imbalanced dataset. Researchers in the field of imbalanced classification problems have long been focusing on improving the recognition rate of minority samples. Currently, two main strategies are used to address classification problems with imbalanced data. The first strategy is to change the distribution of various classes in a dataset, and the second is to design or modify learning algorithms to reduce the negative effect of class imbalance. The first strategy can be further classified into undersampling [4-6], oversampling [7-11], and hybrid sampling methods [12-14], which are used to change the class distribution. Undersampling rebalances an imbalanced dataset by removing part of the samples from the majority class. Oversampling selects some samples from the minority class or generates new samples based on the existing minority samples, and then, adds the selected or generated samples into the minority class, thereby obtaining a balanced dataset. Hybrid sampling transforms an imbalanced dataset into a balanced dataset by adding minority samples via oversampling and reducing the majority samples via undersampling. The second strategy encompasses common methods, such as cost-sensitive techniques [15-17] and ensemble classifiers [18-20]. Costsensitive methods assign higher costs to the minority samples so that the learned classifiers can identify more minority samples. The ensemble classifiers divide the majority samples into several subsets, where each subset has a similar size to the minority class. Once several balanced datasets were generated, each can be used to learn a classifier that will be later combined into an ensemble classifier. This study focuses on the first strategy for handling imbalanced classification problems. The basic objective of this strategy is to change the distribution of datasets and balance the sample size of various classes. This strategy has been featured in recent researches [21-23], which placed more emphasis on the availability of each sample, expecting to incorporate as many informative samples for the classifiers as possible in the balanced dataset. In fact, different classification models have different preferences for samples. For example, classification models based on classification decision boundary are more dependent on the samples near the decision boundary, whereas classification models based on data distribution are more dependent on the overall and local distribution of the samples. Therefore, to obtain informative samples for learning a given classifier, the nature of classification models should be considered before selecting a method for changing the distribution of data. Support vector machine (SVM) is a classification model based on classification decision boundary, and the learned decision hyperplane is only related to support vectors located near the decision hyperplane. Thus, it is reasonable for decision boundary-based classifiers to employ SVM as a preprocessing method to tackle the problem of imbalanced data. Farquad and Bose [24] employed a trained SVM to preprocess imbalanced data. As a result, more minority samples were correctly predicted without compromising the accuracy of the system. Lin [26] set a regularization parameter for SVM by employing a classification performance. They then used an SVM with the selected regularization parameter as a preprocessor for an imbalanced dataset for further modeling. After the original dataset was balanced using the SVM, the classification ability for the minority samples was improved. Wang [12] learned an SVM decision hyperplane, and resampled an imbalanced dataset in the light of the distance between majority samples and the SVM hyperplane to balance the dataset. Based on the initial hyperplane of SVM, Guo et al. [27] selected key samples of the majority class and learned a final SVM classifier using the key samples of the majority class and all the minority samples. The regularization parameter in SVM is widely known to be crucial in learning classification hyperplanes. Different regularization parameters produce different classification hyperplanes for a given dataset. Although several studies [12,24,26] realized the importance of regularization parameters for SVM, these methods learned SVM classifiers by setting regularization parameter without specifying an explicit method for its selection. The SVM regularization parameter in [25] was selected using the enumeration method. Safe sample screening [27] constructs a series of safe sample screening rules using a regularization path algorithm [28]. For each regularization parameter, safe sample screening can identify a part of noninformative samples, and screen them out prior to the training phase without affecting the performance of the classifiers. Safe sample screening has two notable features. First, it can distinguish part of noninformative samples from a given dataset. Second, it can obtain a series of screened datasets corresponding to multiple regularization parameters. These two features inspired us to employ safe sample screening for handling imbalanced data. As safe sample screening does not consider the characteristics of imbalanced data, we need to solve some problems to apply safe sample screening to imbalanced data. The challenges include the selection of a suitable regularization parameter for obtaining an informative screened dataset from a series of screened datasets and the utilization of a series of screened datasets to generate informative minority samples for oversampling. In this study, we developed a resampling algorithm, called Re-SSS, for imbalanced datasets based on safe sample screening. The Re-SSS algorithm is composed of Re-SSS-IS and Re-SSS-WSMOTE. The Re-SSS-IS selects a suitable regularization parameter for an imbalanced dataset and employs the screened dataset, corresponding to the suitable regularization parameter, to obtain informative samples from the majority class. The Re-SSS-WSMOTE sets the weight for each sample in the minority class based on a series of screened datasets, then generates informative minority samples based on the weighted minority samples, and finally adds the synthetic samples into the dataset. This study is based on our previous work [29,30]. In [29], the authors applied safe double screening (including sample screening and feature screening) to the higher dimensional imbalanced data, while both [30] and this study merely adopted safe sample screening. Undersampling methods in [29,30] discarded a part of samples in the majority according to the classification performance of learned SVM classifiers, which is time consuming. To improve efficiency, this study set the number of retained minority samples as the criteria of discarding samples. Moreover, this study developed a new oversampling algorithm Re-SSS-WSMOTE, while both [29] and [30] directly used SMOTE. The main contributions of this study are as follows: 1. A resampling algorithm based on safe sample screening is developed. In this algorithm, the informative samples for the SVM classifier in the majority class are retained and the synthetic minority samples are generated using a series of screened datasets to obtain a balanced dataset. 2. A feasible method of selecting the regularization parameter of the SVM classifier for imbalanced data is developed. This method employs the number of retained samples in the minority class after safe sample screening to select the regularization parameter of the SVM classifier. 3. Experiments are conducted to verify the effectiveness of the developed resampling algorithm and the method of selecting the best regularization parameter. The rest of this paper is organized as follows. Section 2 introduces the related work of method proposed in this paper with an emphasis on SVM and safe sample screening. Section 3 introduce the developed resampling algorithm in detail. Section 4 presents the experimental datasets and gives the experimental results and analysis. Section 5 draws our conclusion and future work outlook. 2. Related Work2.1 SVMConsidering a training dataset [TeX:] $$\boldsymbol{D}=\left\{\left(\boldsymbol{x}_{i}, y_{i}\right) \mid \boldsymbol{x}_{i} \in \boldsymbol{R}^{d}, y_{i} \in\{-1,+1\}, i=1,2, \ldots, n\right\}$$ with d features and n samples, the classification decision function learned by SVM can be expressed as
where [TeX:] $$\boldsymbol{w}=\left(w_{1}, w_{2}, \ldots, w_{d}, w_{d+1}\right)$$ is composed of the weight vectors [TeX:] $$\left(w_{1}, w_{2}, \ldots, w_{d}\right)$$ for the learned decision hyperplane and model bias [TeX:] $$w_{d+1}.$$ To obtain an SVM classifier with fault tolerance capability, a soft margin SVM can be built by solving the following optimization problem.
(2)[TeX:] $$\min _{w} P_{\lambda}(\boldsymbol{w})=\frac{\lambda}{2}\|\boldsymbol{w}\|^{2}+\sum_{i=1}^{n} \ell\left(y_{i} f\left(\boldsymbol{x}_{i}\right)\right).$$where [TeX:] $$\ell(\bullet)$$ is a loss function, and [TeX:] $$\lambda$$ is a regularization parameter for controlling the trade-off between the regularization term and the loss term. If a hinge loss function is adopted and a slack variable [TeX:] $$\xi_{i}$$ is introduced, (2) can be rewritten as
(3)[TeX:] $$\begin{array}{c} \min _{w} P_{\lambda}(\boldsymbol{w})=\frac{\lambda}{2}\|\boldsymbol{w}\|^{2}+\sum_{i=1}^{n} \xi_{i} \\ \text { s.t. } y_{i} f\left(\boldsymbol{x}_{i}\right) \geq 1-\xi_{i}, \quad \xi_{i} \geq 0, i=1,2, \ldots, n. \end{array}$$The dual problem in (3) can be written as
(4)[TeX:] $$\max _{\alpha} Q_{\lambda}(\boldsymbol{\alpha})=\sum_{i=1}^{n} \alpha_{i}-\frac{1}{2 \lambda} \sum_{i, j=1}^{n} y_{i} y_{j} \alpha_{i} \alpha_{j}\left\langle\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right\rangle \text { s.t. } 0 \leq \alpha_{i} \leq 1, i=1, \ldots, n.$$If an n-dimensional vector of Lagrange multipliers [TeX:] $$\boldsymbol{\alpha}^{*}=\left(\alpha_{1}^{*}, \ldots, \alpha_{i}^{*}, \ldots, \alpha_{n}^{*}\right)$$ is the solution of (4), the classification decision function f(x) in (1) can be rewritten as
(5)[TeX:] $$f(\boldsymbol{x})=\frac{1}{\lambda} \sum_{i=1}^{n} \alpha_{i}^{*} y_{i} \boldsymbol{x}_{i}^{T} \boldsymbol{x},$$where [TeX:] $$\alpha_{i}^{*}$$ is the Lagrange multiplier corresponding to [TeX:] $$\left(\boldsymbol{x}_{i}, y_{i}\right) \in \boldsymbol{D}.$$ Based on the optimality conditions of (3) or (4), the samples in D can be categorized into three types: safe sample [TeX:] $$(S S)$$ , boundary sample [TeX:] $$(B S),$$ and noise sample [TeX:] $$(N S).$$
(6)[TeX:] $$\begin{array}{c} \left(\boldsymbol{x}_{i}, y_{i}\right) \in \boldsymbol{S S}: y_{i} f\left(\boldsymbol{x}_{i}\right)>1 \Rightarrow \alpha_{i}^{*}=0, \\ \left(\boldsymbol{x}_{i}, y_{i}\right) \in \boldsymbol{B} \boldsymbol{S}: y_{i} f\left(\boldsymbol{x}_{i}\right)=1 \Rightarrow \alpha_{i}^{*} \in(0,1), \\ \left(\boldsymbol{x}_{i}, y_{i}\right) \in \boldsymbol{N} \boldsymbol{S}: \quad y_{i} f\left(\boldsymbol{x}_{i}\right)<1 \Rightarrow \alpha_{i}^{*}=1, \end{array}$$The safe sample, located outside the classification margin, is far from the classification hyperplane. For any sample [TeX:] $$\left(\boldsymbol{x}_{i}, y_{i}\right) \in \boldsymbol{S S}, \alpha_{i}^{*} y_{i} \boldsymbol{x}_{i}^{T}=\mathbf{0} \text { holds as } \alpha_{i}^{*}=0.$$ Therefore, the safe sample has no influence on the determination of the decision function f(x) in (5). Even if these samples are removed from the training data set, the classification hyperplane would not be changed. The boundary sample lies on the boundary of the margin, and is near to the classification hyperplane. As [TeX:] $$\alpha_{i}^{*}$$ corresponding to [TeX:] $$\left(\boldsymbol{x}_{i}, y_{i}\right) \in \boldsymbol{B} \boldsymbol{S}$$ is a nonzero value, [TeX:] $$\alpha_{i} y_{i} \boldsymbol{x}_{i}^{T} \neq \mathbf{0}$$ always holds, except for [TeX:] $$\boldsymbol{x}_{i}=\mathbf{0}.$$ This means that the boundary sample is involved in the calculation of the decision function [TeX:] $$f(\boldsymbol{x}),$$ thereby affecting the choice of SVM classification hyperplane. For the noise sample [TeX:] $$\left(\boldsymbol{x}_{i}, y_{i}\right), \alpha_{i} y_{i} \boldsymbol{x}_{i}^{T} \neq \mathbf{0}$$ holds in most cases, except for [TeX:] $$x_{i}=0.$$ Thus, the noise sample will also affect the decision function [TeX:] $$f(\boldsymbol{x})$$ in (5). The location of noise sample is related to slack variable [TeX:] $$\xi_{i} . \text { If } \xi_{i}>1,$$ the noise sample [TeX:] $$\left(\boldsymbol{x}_{i}, y_{i}\right)$$ is located between the classification hyperplane and the margin boundary close to the true class. To simplify the problem, all the samples with [TeX:] $$\alpha_{i}=1$$ are known as noise samples. By analyzing the three types of samples, it was found that the boundary sample should be given more attention for learning the SVM classifier. For imbalanced datasets, the imbalanced ratio (the most common class imbalance metric) is the proportion of the samples between various classes. However, the classification performance of SVM is more dependent on the boundary sample, rather than on all the samples. Therefore, to learn an SVM classifier on an imbalanced dataset, we should focus on the proportion of boundary samples between different classes, rather than the proportion of all the samples between different classes. In addition, [TeX:] $$\lambda$$ in (2) is generally understood as a trade-off parameter for balancing the generalization and fitting performances of SVM. For imbalanced data, the value of [TeX:] $$\lambda$$ can influence the position of the classification hyperplane. In general, when the [TeX:] $$\lambda$$ value is smaller, the hyperplane moves more toward the majority class; thus, more minority samples can be correctly classified [24]. For example, as shown in Fig. 1, compared with the decision hyperplane (solid line) learned by SVM with [TeX:] $$\lambda=1,$$ the decision hyperplane (dashed lines) learned by SVM with [TeX:] $$\lambda=0.0001$$ moves toward the majority samples (red dots), and three more samples (green stars) of the minority class are correctly predicted. Therefore, to improve the recognition rate of the minority samples, it is necessary to set an appropriate value of [TeX:] $$\lambda.$$ However, if the value of [TeX:] $$\lambda$$ is selected through enumeration method for learning the best SVM model, it would be time consuming. 2.2 Safe Sample ScreeningSafe sample screening [27,31] is based on the SVM and regularization path algorithms. Given a dataset, safe sample screening can rapidly screen out parts of the safe samples via safe sample screening rules, and generate a series of screened subsets using a regularization path algorithm. 2.2.1 Safe sample screening rulesSafe sample screening rules can be used to identify non-informative samples. To construct safe sample screening rules, we adopted the objective function of SVM from (2) for safe sample screening. However, owing to the non-differentiability of the hinge loss function at the inflection point, the smooth hinge loss function from (7) was adopted as the loss function of safe sample screening, to ensure that it is differentiable everywhere within the range of values.
(7)[TeX:] $$\ell\left(y_{i} f\left(\boldsymbol{x}_{i}\right)\right)=\left\{\begin{array}{ll} 0 & y_{i} f\left(\boldsymbol{x}_{i}\right)>1 \\ \frac{1}{2 \gamma}\left[1-y_{i} f\left(\boldsymbol{x}_{i}\right)\right]^{2}, & 1-\gamma \leq y_{i} f\left(\boldsymbol{x}_{i}\right) \leq 1, \\ 1-y_{i} f\left(\boldsymbol{x}_{i}\right)-\frac{\gamma}{2}, & y_{i} f\left(\boldsymbol{x}_{i}\right)<1-\gamma \end{array}\right.$$where [TeX:] $$\gamma>0$$ is a tuning parameter. The dual problem of safe sample screening can be written as
(8)[TeX:] $$\max _{\alpha} D_{\lambda}(\boldsymbol{\alpha})=-\frac{\lambda}{2} \sum_{j=1}^{d}\left(\sum_{i=1}^{n} \frac{1}{\lambda n} x_{i j} \alpha_{i} y_{i}\right)^{2}-\frac{1}{n} \sum_{i=1}^{n}\left(\frac{\gamma}{2} \alpha_{i}^{2}-y_{i} \alpha_{i}\right).$$Let us assume that [TeX:] $$\boldsymbol{w}^{*}=\left(w_{1}^{*}, \ldots, w_{d}^{*}, w_{d+1}^{*}\right) \text { and } \boldsymbol{\alpha}^{*}=\left(\alpha_{1}^{*}, \ldots, \alpha_{i}^{*}, \ldots, \alpha_{n}^{*}\right)$$ represent the optimal solution of the primal and dual problems for safe sample screening, respectively. In the case of smooth hinge loss, according to the Karush-Kuhn-Tucker (KKT) optimality conditions, we can obtain
(9)[TeX:] $$y_{i} \boldsymbol{x}_{i}^{T} \boldsymbol{w}^{*}=\left\{\begin{array}{ll} {[1, \infty)} & \alpha_{i}^{*}=0 \\ (1-\gamma, 1), & \alpha_{i}^{*} \in(0,1). \\ (-\infty, 1-\gamma], & \alpha_{i}^{*}=1 \end{array}\right.$$Similar to SVM, the samples with [TeX:] $$\alpha_{i}^{*}=0, \alpha_{i}^{*} \in(0,1), \text { and } \alpha_{i}^{*}=1$$ are called safe samples, boundary samples, and noise samples, respectively. Safe sample screening is aimed at removing parts of the safe and noise samples and retaining all of the boundary samples. To identify safe and noise samples, a solution space [TeX:] $$\Theta_{w^{*}}$$ containing the optimal solution [TeX:] $$\boldsymbol{W}^{*}$$ is first constructed by employing the feasible solutions of the primal and dual problems. Specifically, for any given feasible solutions [TeX:] $$\widehat{\boldsymbol{w}} \in \operatorname{dom} P_{\lambda}$$ and [TeX:] $$\widehat{\boldsymbol{\alpha}} \in \operatorname{dom} D_{\lambda},$$
(10)[TeX:] $$\boldsymbol{w}^{*} \in \Theta_{w^{*}}=\left\{\boldsymbol{w} \mid\|\widehat{\boldsymbol{w}}-\boldsymbol{w}\| \leq \sqrt{2\left[P_{\lambda}(\widehat{\boldsymbol{w}})-D_{\lambda}(\widehat{\boldsymbol{\alpha}})\right] / \lambda}\right\}.$$A pair of lower and upper bounds of [TeX:] $$y_{i} \boldsymbol{x}_{i}^{T} \boldsymbol{w}^{*}$$ is given as
(11)[TeX:] $$\operatorname{LB}\left(y_{i} \boldsymbol{x}_{i}^{T} \boldsymbol{w}^{*}\right)=y_{i} \boldsymbol{x}_{i}^{T} \widehat{\boldsymbol{w}}-\left\|y_{i} \boldsymbol{x}_{i}\right\| \sqrt{2\left[P_{\lambda}(\widehat{\boldsymbol{w}})-D_{\lambda}(\widehat{\boldsymbol{\alpha}})\right] / \lambda},$$
(12)[TeX:] $$\operatorname{UB}\left(y_{i} \boldsymbol{x}_{i}^{T} \boldsymbol{w}^{*}\right)=y_{i} \boldsymbol{x}_{i}^{T} \widehat{\boldsymbol{w}}+\left\|y_{i} \boldsymbol{x}_{i}\right\| \sqrt{2\left[P_{\lambda}(\widehat{\boldsymbol{w}})-D_{\lambda}(\widehat{\boldsymbol{\alpha}})\right] / \lambda}.$$According to (9), (11), and (12), safe sample screening rules can be represented as follows. Screening rule 1: If [TeX:] $$\operatorname{LB}\left(y_{i} \boldsymbol{x}_{i}^{T} \boldsymbol{w}^{*}\right) \geq 1, \text { then }\left(\boldsymbol{x}_{i}, y_{i}\right) \in \boldsymbol{S} \boldsymbol{S} \text { and }\left(\boldsymbol{x}_{i}, y_{i}\right)$$ can be discarded. Screening rule 2: If [TeX:] $$\text { UB }\left(y_{i} \boldsymbol{x}_{i}^{T} \boldsymbol{w}^{*}\right) \leq 1-\gamma, \text { then }\left(\boldsymbol{x}_{i}, y_{i}\right) \in \boldsymbol{N} \boldsymbol{S} \text { and }\left(\boldsymbol{x}_{i}, y_{i}\right)$$ can be discarded. Using the above mentioned two rules, the safe and noise samples, which were identified, will be discarded, and the remaining samples will be retained. The advantage of this method is that it can reduce the sample size by employing the relationship between the optimal and feasible solutions, without directly solving the optimization problems. 2.2.2 Regularization path solving strategyTo set a [TeX:] $$\lambda$$ value in (2), the authors [28] proposed an SVM regularization path algorithm that can quickly solve all feasible [TeX:] $$\lambda$$ values and the corresponding SVM on a given sample set D. The initial [TeX:] $$\lambda$$ value can be obtained from the original dataset. When the boundary samples of the interval change, each [TeX:] $$\lambda_{m}$$ is solved from its previous [TeX:] $$\lambda_{m-1},$$ and the iteration is continued until there are no samples in the interval or [TeX:] $$\lambda$$ is reduced to 0. Owing to the piecewise linearity of the SVM regularization path on regularization parameters, a complete regularization path can be obtained by solving the inflection points of the regularization parameters. In a given dataset, it is not necessary to solve all the values of [TeX:] $$\lambda.$$ In [27], the regularization parameter values in a given range were solved. Safe sample screening only constructs safe sample screening rules corresponding to the regularization parameter values in this range. As the convergence of SVM tends to be faster for larger regularization parameters, the regularization path is computed from larger [TeX:] $$\lambda$$ to smaller [TeX:] $$\lambda$$ using the warm-start method [27]. In the solving process, the previous optimal solution at [TeX:] $$\lambda_{m-1}$$ is used as the initial starting point of the next optimization problem for [TeX:] $$\lambda_{m}.$$ The upper bound [TeX:] $$\lambda_{\max }$$ and lower bound [TeX:] $$\lambda_{\min }$$ of the range are as follows:
(13)[TeX:] $$\lambda_{\max }=\max _{1 \leq j \leq d}\left|\frac{1}{n} \sum_{i=1}^{n} x_{i j} y_{i}\right| ; \quad \lambda_{\min }=10^{-4} \lambda_{\max }.$$Given that [TeX:] $$\lambda_{m} \in\left[\lambda_{\min }, \lambda_{\max }\right],$$ the values of the upper and lower bounds for [TeX:] $$y_{i} \boldsymbol{x}_{i}^{T} \boldsymbol{w}^{*}$$ of each sample can be determined by (11) and (12), respectively. The samples that meet the screening conditions of screening rules 1 and 2 will be removed, and the retained samples are the result of the safe sample screening described in [27]. 3. Proposed AlgorithmWe developed the Re-SSS algorithm, comprising Re-SSS-IS and Re-SSS-WSMOTE, to change the distribution of imbalanced data based on safe sample screening. The former is used to select a suitable regularization parameter for imbalanced data and obtain informative samples of the majority class, and the latter is used to generate informative minority samples. Notably, both Re-SSS-IS and Re-SSSWSMOTE can be performed as a part of the Re-SSS, or separately. 3.1 Re-SSS-ISSafe sample screening can generate a series of screened datasets, each of which contains fewer samples than the original dataset. To find the informative majority samples from an imbalanced dataset using the safe sample screening, two problems should be solved: the first is setting up the range of the regularization parameter values, and the second is selecting the suitable regularization parameter and its corresponding screened dataset. To solve the first problem, we first analyzed the range of the [TeX:] $$\lambda$$ values (see (13)) in [27] and adjusted it for imbalanced data. To simplify the discussion, we assumed that [TeX:] $$\left|\frac{1}{n} \sum_{i=1}^{n} x_{i j} y_{i}\right|$$ in (13) reaches the maximum at the j-th attribute and that [TeX:] $$x_{i j}=1(i=1, \ldots, n) ; \text { thus, } \lambda_{\max }$$ in (13) can be rewritten as
where [TeX:] $$n_{+} \text {and } n_{-}$$ represent the number of minority and majority samples in the original dataset, respectively. For imbalanced datasets, [TeX:] $$\left(n_{-}-n_{+}\right)$$ is usually very large. However, [29] found that, if [TeX:] $$\lambda$$ is very large, there will be little or no boundary samples. Thus, the safe samples in the minority class are retained, which is not beneficial for identifying samples in the minority class. To avoid this case, we added a hyperparameter [TeX:] $$c<1$$ into (14) (the value of c is given in Section 4.3, which was determined in our experiments), and obtained the maximum value of the regularization parameter [TeX:] $$\lambda$$ as
A smaller [TeX:] $$\lambda_{\max }$$ retains more boundary and safe samples in the minority class. In addition, [TeX:] $$\lambda_{\min }$$ was assigned in the same way as (13); thus, the range of regularization parameter values was [TeX:] $$\left[\lambda_{\min }, \lambda_{\max }\right].$$ For the second problem, our solution was to find a classification hyperplane with maximum margin, which could correctly predict as many informative minority samples as possible. For a given regularization parameter [TeX:] $$\lambda$$, the sets of safe, boundary, and noise samples in the minority class were denoted as [TeX:] $$\boldsymbol{S S}_{\lambda}^{+}, \boldsymbol{B} \boldsymbol{S}_{\lambda}^{+}, \text {and } \boldsymbol{N} \boldsymbol{S}_{\lambda}^{+},$$ respectively. As the noise samples may have a negative effect on the classifier, we expected that only the safe and boundary samples were correctly identified. Hence, we wanted to find a suitable regularization parameter [TeX:] $$\lambda^{*},$$ which has the maximum number of safe and boundary samples. This solution can be expressed as
(16)[TeX:] $$\lambda^{*}=\underset{\lambda}{\operatorname{argmax}}\left|\boldsymbol{B} \boldsymbol{S}_{\lambda}^{+} \cup \boldsymbol{S} \boldsymbol{S}_{\lambda}^{+}\right|,,$$First, utilizing the regularization path algorithm, the Re-SSS-IS algorithm quickly obtained a series of feasible [TeX:] $$\lambda$$ values, with their corresponding screened datasets. Then, the screened dataset with the largest [TeX:] $$\left|B S_{\lambda}^{+} \cup S S_{\lambda}^{+}\right|$$ was selected, and the corresponding [TeX:] $$\lambda$$ was set as the suitable regularization parameter [TeX:] $$\lambda^{*}.$$ Lastly, the majority samples in [TeX:] $$\boldsymbol{B} \boldsymbol{S}_{\lambda^{*}}$$ corresponding to [TeX:] $$\lambda^{*}$$ were taken as the set of informative majority samples [TeX:] $$\boldsymbol{B} \boldsymbol{S}_{\lambda^{*}}^{-}.$$ 3.2 Re-SSS-WSMOTETo generate informative minority samples, this study developed a modified SMOTE algorithm, Re- SSS-WSMOTE, for imbalanced data. SMOTE is a popular oversampling method that generates synthetic samples from existing minority samples. However, not all the minority samples are useful for learning an SVM, and the samples far from the decision hyperplane are more likely to have no effect on learning the classifier. In general, if both the sample and its selected similar sample are boundary samples, the sample generated by combining these two samples is more likely to be a boundary sample; otherwise, the generated sample will be more likely to be a safe sample. Thus, the availability of a synthetic sample for SVM is related to the availability of the two selected original samples. Next, we need to consider how to determine the availability of each sample. In Section 2.1, we compared two decision hyperplanes learned by the SVM with [TeX:] $$\lambda=0.0001 \text { and } \lambda=1,$$ and found that the decision hyperplanes learned by the SVM with different [TeX:] $$\lambda$$ values may be different. In fact, support vectors of SVM with different [TeX:] $$\lambda$$ values may not be exactly similar, as shown in Fig. 2. We can see from Fig. 2 that points 1, 2, 3, and 4 are the support vectors of the SVM with [TeX:] $$\lambda=1,$$ and points 2, 3, and 4 are the support vectors of the SVM with [TeX:] $$\lambda=0.0001.$$ Points 2, 3, and 4 are the common support vectors of the SVM for the two different [TeX:] $$\lambda$$ values, which means that these points were more likely located closest to the classification hyperplane. Based on the above analysis, we used the weight value to represent the availability of each sample. The weight value of each sample was calculated based on the screened datasets corresponding to the different [TeX:] $$\lambda$$ values. [TeX:] $$\boldsymbol{B S}_{\lambda_{1}}, \boldsymbol{B} \boldsymbol{S}_{\lambda_{2}}, \ldots, \boldsymbol{B} \boldsymbol{S}_{\lambda_{T}}$$ are the boundary sample sets with different [TeX:] $$\lambda \text { values; } \boldsymbol{B S}=\boldsymbol{B} \boldsymbol{S}_{\lambda_{1}} \cup \boldsymbol{B} \boldsymbol{S}_{\lambda_{2}} \cup \ldots \cup \boldsymbol{B} \boldsymbol{S}_{\lambda_{T}}$$ denotes the set of boundary samples for T regularization parameters; and [TeX:] $$\left\{\left(\boldsymbol{x}_{i}, y_{i}\right) \mid\left(\boldsymbol{x}_{i}, y_{i}\right) \in \boldsymbol{B} \boldsymbol{S}, y_{i}={ }^{\prime}+1^{\prime}\right\}$$ is the set of the minority boundary samples. As some samples in the original minority class might not exist in [TeX:] $$\boldsymbol{B S}^{+},$$ we adopted a Laplace correction to adjust the weight values to prevent these samples from being selected. The weight value of each minority sample was set as
(17)[TeX:] $$W e\left(x_{i}, y_{i}\right)=\left\{\begin{array}{l} \frac{k_{i}+1}{\sum_{i=1}^{\left|B S^{+}\right|} k_{i}+\left|B S^{+}\right|}, \text {if }\left(x_{i}, y_{i}\right) \in B S^{+} \\ \frac{1}{\sum_{i=1}^{\left|B S^{+}\right|} k_{i}+\left|B S^{+}\right|}, \text {otherwise }, \end{array}\right.$$where [TeX:] $$k_{i}$$ denotes the number of boundary sample sets containing [TeX:] $$\left(\boldsymbol{x}_{i}, y_{i}\right),$$ namely
(19)[TeX:] $$I_{B S_{\lambda_{j}}}\left(\boldsymbol{x}_{i}, y_{i}\right)=\left\{\begin{array}{l} 1, \mathrm{if}\left(\boldsymbol{x}_{i}, y_{i}\right) \in \boldsymbol{B} \boldsymbol{S}_{\lambda_{j}}, \\ 0, \mathrm{if}\left(\boldsymbol{x}_{i}, y_{i}\right) \notin \boldsymbol{B} \boldsymbol{S}_{\lambda_{j}}, \end{array}\right.$$In summary, the Re-SSS-WSMOTE first obtained a series of screened datasets via safe sample screening, and employed the boundary sample sets [TeX:] $$\boldsymbol{B} \boldsymbol{S}_{\lambda_{1}}, \boldsymbol{B} \boldsymbol{S}_{\lambda_{2}}, \ldots, \boldsymbol{B} \boldsymbol{S}_{\lambda_{\mathrm{T}}}$$ to calculate the weight of each sample according to (17). Then, a minority sample was randomly selected according to the weight of the sample, and its similar sample was selected from its k-nearest neighbors according to the weight of the sample. Finally, the linear interpolation method was applied to the two selected samples to generate a synthetic sample. The Re-SSS-WSMOTE algorithm is shown as follows. Note that, if only Re-SSS-WSMOTE is used, we will use the original majority sample set [TeX:] $$\mathrm{D}^{-},$$ instead of the informative majority sample set [TeX:] $$\boldsymbol{B} \boldsymbol{S}_{\lambda^{*}}^{-},$$ as the majority sample set in Re-SSS-WSMOTE. 4. Experiments and Analysis4.1 DatasetsTo investigate the effectiveness of the Re-SSS algorithm, we chose 35 datasets from the UCI Repository (http://archive.ics.uci.edu/ml/index.php), LIBSVM datasets (https://www.csie.ntu.edu.tw/ ~cjlin/libsvmtools/datasets/), and KEEL dataset repository (http://www.keel.es/). Except for the original two-class datasets available, we also chose a part of the multiclass datasets and transformed them into two-class imbalanced datasets. For example, for movement-1 with multiple classes, the samples with “1”, “2”, and “3” class labels in the original dataset were combined as the minority class, and the samples with the other class labels were combined as the majority class. In addition, we applied min-max normalization to each dataset for learning the SVM classifiers. Table 1 shows a detailed description of each dataset. The third and fourth columns show the class constituents of the minority and majority classes in each dataset, respectively. The number of minority samples, number of majority samples, number of features, and imbalance ratio of each dataset are listed from the fifth to eighth columns, respectively. The last column presents the source of each dataset. Table 1.
4.2 Performance Evaluation MetricsFor imbalanced classification problem, the suitable metrics should not be dominated by the majority samples. In [32], the impact of class imbalance on classification performance metrics has been systematically studied. The results have shown that the metrics with no bias due to imbalance, recall, specificity, geometric mean (G-Mean), and area under curve (AUC), are the best performance metrics. As the specificity takes into account only the results on the majority class, we did not use it as a performance metric. The F-score is also commonly used for imbalanced data. Thus, we compared the different methods by using the four metrics: recall, F-score, G-Mean, and AUC. The recall measures the ratio of minority samples correctly classified as the minority class to all the minority samples. The range of the recall values is [0,1]. The higher the recall, the higher the recognition rate of the minority samples is. The F-score is the harmonic mean of the precision and recall, namely F-score=(2*recall*precision) /(recall+precision), where the precision measures the ratio of minority samples correctly classified to all the samples classified as the minority class. The F-score works well for the recognition rate of the minority samples. The G-Mean is the geometric mean of the recall and specificity with a range of [0,1]. The specificity is the actual proportion of majority samples that are correctly identified. The closer the G-Mean value is to 1, the better the classification effect is
The AUC is the area under the Receiver Operating Characteristic (ROC) curve. The range of the AUC values is [0,1], and the AUC value less than 0.5 indicates that the result is not as good as random prediction. The AUC value can well reflect the classification performance of the model. 4.3 Experimental Results and AnalysisIn this section, two experiments were performed. The first experiment involved the regularization parameters of the SVM classifiers on the original imbalanced datasets, and the second presented the results of SVM classifiers on the datasets balanced using different methods for changing the distribution of data. The parameters used in the two experiments are as follows: [TeX:] $$T=100, c=10^{-1.6}, \Delta \lambda=10^{0.04-0.04 m}, \text { and } k=5.$$ 4.3.1 Experiments on different regularization parameters of the SVM classifiers on the original imbalanced datasetsThe main purpose of this experiment was to examine the superiority of the regularization parameter [TeX:] $$\lambda^{*}$$ obtained using the Re-SSS-IS algorithm. First, we applied the Re-SSS-IS algorithm on each original imbalanced dataset, and obtained a suitable regularization parameter [TeX:] $$\lambda^{*}$$ for each dataset. Then, the SVM classifier with [TeX:] $$\lambda^{*}$$ was built directly on the original imbalanced dataset. For comparison, the SVM classifiers with the other 11 regularization parameters were also built. The experiments were performed using 5-fold cross-validation. In each fold, the original dataset was split into training and test data. The SVM classifier with the corresponding regularization parameter was built on the training data; the AUC, F-score, G-Mean, and recall of the SVM classifier on the test data was then recorded and averaged across all the splits. The experimental results are presented in Table 2. The first row lists all the regularization parameters used for the SVM classifiers in the experiments. The second to twelfth columns present the experimental results of the SVM classifiers with different regularization parameters. Each row shows the average performance metrics of the SVM classifiers with the corresponding regularization parameter on the 35 datasets. For example, the average AUC of the SVM classifier with [TeX:] $$\lambda^{*}$$ on the 35 datasets was 0.846. As the value of [TeX:] $$\lambda^{*}$$ obtained using the Re-SSS algorithm was different for each dataset, we did not list the value of [TeX:] $$\lambda^{*}$$ in Table 2. From Table 2, it can be seen that the performance metrics of the SVM classifiers were related to the values of [TeX:] $$\lambda \text { . When } \lambda$$ was larger, the SVM classifiers had poorer average performance; when [TeX:] $$\lambda$$ was smaller, the SVM classifiers obtained better average performance. This result is consistent with the discussion in Section 2.1. However, the values of[TeX:] $$\lambda$$ corresponding to the maximum AUC, Fscore, G-Mean, and recall were not the smallest, in other words, it is not true that the smaller the value of [TeX:] $$\lambda,$$ the better the performance of the SVM classifier. Moreover, each dataset obtained its maximum metrics under different [TeX:] $$\lambda$$ values; hence, an appropriate [TeX:] $$\lambda$$ value for a given dataset needs to be selected. Furthermore, the SVM classifier with [TeX:] $$\lambda^{*}$$ obtained using the Re-SSS-IS was close to the maximum average metrics. This result shows that the Re-SSS-IS algorithm can select the appropriate value for [TeX:] $$\lambda$$. Table 2.
4.3.2 Experiments on the SVM classifiers with datasets balanced using different data balancing methodsTo verify the effectiveness of the developed Re-SSS algorithm, we compared the Re-SSS algorithm with the other methods for changing the distribution of data. The methods proposed in [12,26] did not explicitly mention the adjustment of regularization parameters in SVMs. In [24,25], the SVMs were used for the preprocessing of samples by adjusting the regularization parameters in a similar manner. Aside from preprocessing of samples, feature selection was also used in [25]. However, there was no feature selection involved in our study. Thus, only the Pre-SVM algorithm in [24] was selected as one of the baseline methods in our study. The five baseline methods are described as follows: Undersampling: The original dataset is randomly under-sampled; that is, the samples are randomly extracted from the majority class, and the number of samples extracted is equal to the number of minority samples. Oversampling: The original dataset is randomly oversampled, that is, randomly extracting samples from the minority class, and adding the samples to the original dataset to balance the dataset. SMOTE: Based on the existing minority samples, an interpolation method is used to generate new minority samples, which are added to the original dataset to balance the dataset. BorderLine-SMOTE: The improved version of SMOTE uses an interpolation method to generate new small minority samples according to the existing minority class boundary samples, which are added to the original dataset to balance the dataset. Pre-SVM [24]: First, this algorithm builds the SVM models on imbalanced data, and the SVM model with the best prediction accuracy is selected and used for prediction purposes. Then, the actual target values of the training samples are replaced by the prediction of the trained SVM. Table 3.
First, four baseline methods were used to rebalance the datasets, and then, the SVM classifiers with the frequently-used regularization parameters (namely 0.1, 1) were performed on the balanced datasets using the four baseline methods. The Pre-SVM and Re-SSS adaptively chose the regularization parameters. The experimental performance evaluation metrics used were similar to those of the first experiment. The experimental AUC results are shown in Table 3, where the first row presents the method used. It can be seen that the Re-SSS method performed optimally on 18 datasets, followed by Borderline-SMOTE [TeX:] $$(\lambda=0.1 \text { ) }$$and Pre-SVM, which performed optimally on 5 datasets. From Table 3, it is clear that the Re-SSS surpassed the other methods in AUC. In addition, the average value for each oversampling method (Oversampling, SMOTE, and Borderline-SMOTE) was better than that for the undersampling. As the experimental results of the other three metrics were similar to those of the AUC, we have not included them here. The comprehensive experimental results are presented in Table 4. Table 4.
In Table 4, the first row presents the method used, and the row named “number” presents the number of datasets for which the SVM classifier with its corresponding regularization parameter obtained optimal performance for a certain evaluation metric. Note that the sum of ten numbers in each row may be greater than 35 as the SVM classifiers with different [TeX:] $$\lambda$$ values may have the same results. The row named “average” presents the average obtained by the SVM classifiers with the corresponding regularization parameters on 35 datasets for a certain evaluation metric. It can be seen from Table 4 that, in most cases, the result of [TeX:] $$\lambda=0.1$$ is better than that of [TeX:] $$\lambda=1$$ when using the same method. With the decrease in the value of [TeX:] $$\lambda,$$ the experimental performance of the dataset was more favorable for minority classes. For the four different metrics, the Re-SSS was superior to the other methods in terms of the number of datasets, in which it showed optimal performance and had the highest average value. This verifies the feasibility of the Re-SSS algorithm developed in this study for handling the imbalanced classification problem. In addition, it can be seen from Table 4 that the oversampling method performed slightly better than the undersampling method. 5. Conclusion and Future WorkWe developed a resampling Re-SSS algorithm, made up of Re-SSS-IS and Re-SSS-WSMOTE based on safe sample screening, to exploit the informative samples learned by the SVM classifier on an imbalanced data set. The Re-SSS-IS algorithm can select suitable regularization parameters and obtain informative majority samples; the Re-SSS-WSMOTE algorithm is used to generate informative minority samples for the SVM classifier. Then, two experiments were conducted to verify the effectiveness of the algorithm. Compared with the other methods, the proposed resampling method showed better performance. The proposed Re-SSS algorithm can not only discard parts of non-informative samples, but also add useful informative ones. Our future work will focus on developing an effective method for selecting hyperparameter c in the Re-SSS algorithm and exploring how to extend the Re-SSS algorithm to address multiclass imbalanced problems. BiographyHongbo Shihttps://orcid.org/0000-0002-9792-364XShe is currently a professor in the School of information, Shanxi University of Finance and Economics. She received her Ph.D. degree in School of Computer and Information Technology from Beijing Jiaotong University in 2004. Her main research interests include machine learning, data mining, sparse learning and imbalanced classification. BiographyXin Chenhttps://orcid.org/0000-0002-1949-101XHe received B.S. degree in Shanxi University of Finance and Economics in 2018. Since September 2018, he has been studying for a master degree in the School of Information from Shanxi University of Finance and Economics. His current research interests include machine learning, data mining and business intelligence. References
|