1.INTRODUCTION
Machine learning algorithms are widely used to solve various kinds of problems in the field of industrial management. We focus on the problem of predicting discrete categorical labels, which can be solved by constructing a prediction model using a set of training data consisting of vectors with attribute variables and their categorical labels. In order to solve these problems, deep neural network algorithms are widely used (Cortes et al., 2017). However, these algorithms have high computational costs and require a lot of training data. Therefore, ensemble algorithms that combine simple machine learning models have been applied for various problems (Opitz and Maclin, 1999) because of their low computational costs and high performance without large training data. In recent years increasing attention has been focused on the stacked generalization algorithm (Wolpert, 1992) that combines ensemble methods because of its high performance. The major ensemble method is Random Forests (RF) (Breiman, 2001) owing to the ease of constructing the model and its high performance. The RF method constructs decision trees (Breiman et al., 1984) randomly and independently. In order to improve RF, methods introducing a minimized loss function to the set of decision trees have been proposed and shown to be effective. Two of these methods are Alternating Decision Forests (ADFs) (Schulter et al., 2013) and Global Refinement of Random Forests (Ren et al., 2015). We focus on ADFs because it is superior to other methods in terms of computational cost and prediction accuracy.
ADFs introduce weights that represent degrees of classification accuracy for the training data. These weights are utilized to expand decision trees during the learning phase in order to improve predictability for training data. When weights are utilized during the learning phase, the prediction accuracy of ADFs rises significantly because the weights enable the construction of global optimal decision trees. However, ADFs have two major problems. First, the learning phase of ADFs treats all training data equally, meaning that it can be strongly affected by outliers. Second, ADFs randomly arrange attribute variables for each decision tree and all attribute variables are treated equally. Thus, ADFs do not consider the importance of each attribute variable. The prediction accuracy of ADFs may be improved by solving these problems.
Based on the above discussion, we propose a new ADF prediction method enhanced by changing the method of assigning weights and reassigning the sets of attribute variables for certain decision trees during the learning phase to improve prediction accuracy. In order to accomplish these enhancements, our method divides the training dataset into two subsets for each decision tree. One subset is used for expanding the decision tree, while the other subset is used for introducing new indicators. One indicator represents an evaluation of whether or not each data point is an outlier. Another indicator represents the importance of a set of attribute variables for predicting new data. Our method utilizes these indicators and places an emphasis on nonoutlier data and important attribute variables to improve prediction accuracy.
In order to verify the effectiveness and robustness of the proposed method, we conducted a simulation experiment with data obtained from the UCI machine learning repository (Blake and Merz, 1998).
In Section 2, the preliminaries, including basic information on decision tree and representative ensemble methods are explained. In Section 3, our proposed method and its algorithm are discussed. To verify the effectiveness of our proposed method, we conducted a classification experiment and compared the error rates with those from conventional methods in Section 4. Finally, the discussion and conclusion are stated in Section 5.
2.PRELIMINARIES
2.1.Problem Formulation
The training data is denoted as ${\left\{\left\{{\mathbf{x}}_{i},\text{\hspace{0.17em}}{y}_{i}\right\}\right\}}_{i=1}^{N}$, where N is the number of training data samples and x_{i} is a q dimensional input vector consisting of attribute variables. Let the set of category labels be $C=\left\{1,\text{\hspace{0.17em}\hspace{0.17em}}2,\text{\hspace{0.17em}}\cdots ,\text{\hspace{0.17em}}k,\text{\hspace{0.17em}}\cdots ,\text{\hspace{0.17em}}K\right\}$. The category label corresponding to x_{i} is y_{i} ∈ C. The prediction problem is to predict the correct category label y_{N+1} when a new data sample x_{N+1} is input into a learned classification rule.
2.2.Decision Tree
A decision tree is a prediction model with a treelike structure consisting of nodes and branches. The set of nodes in a decision tree consists of leaf nodes and decision nodes. A decision node has two or more branches and divides its allocated dataset among its child nodes based on a split function for a specific attribute variable. Leaf nodes are at the bottom of the decision tree. Each data sample is delivered from the root node to a child node through the split function. As this process is repeated, each data sample eventually arrives at a leaf node, which has a classification rule. For example, a classification rule could be the majority vote by the training data samples allocated to the leaf node. A category label for a new input data sample ${\mathbf{x}}_{N+1}$ is predicted by using the classification rule attached to the leaf node to which the data is delivered from the root node through a series of split functions.
The split function at each decision node is set such that the distribution of categories from the training data to its child nodes is as biased as possible. Local optimum split functions are attached to decision nodes in order to raise the amount of training data with a specific category being allocated to a leaf node.
An example decision tree is presented in Figure 1. In this decision tree, nodes No. 1 and 3 are decision nodes, while nodes No. 2, 4, and 5 are leaf nodes.
2.3.Random Forests
There is room for improvement in the generalization ability of a decision tree. Ensemble methods for decision trees have become well known and are often effective in improving prediction accuracy in realworld problems. Many different ensemble methods have been applied to the decision tree model owing to their high compatibility. An example on predicting the category of a new input data sample using an ensemble method for decision trees is presented in Figure 2. As seen in the figure, various decision trees are created and combined. The RF (Breiman, 2001) is the most popular among ensemble methods and has relatively high performance.
The RF makes decision trees grow independently, and all decision trees are trained by different features. In order to achieve this, each decision tree is randomly assigned a portion of the attribute variables when creating the tree structure. The split functions for each decision node are then selected from the restricted subset of attribute variables. Furthermore, each decision tree learns only a subset of the training data. This method allows every decision tree to have different features, because each tree focuses on different training data and attribute variables. An ensemble of these trees can be expected to improve prediction accuracy compared to the original decision tree model.
2.4.Alternating Decision Forests
2.4.1.The Description of ADFs
ADFs aim to construct an ensemble prediction model of decision trees, such that the trees grow toward the accurate prediction of all training data and possess different features. In order to achieve this, ADFs assign a weight to each training data sample. By using these weights, ADFs can construct trees that predict all training data accurately, and each decision tree learns to focus on different training data. Similar to RF, this method assigns a portion of the attribute variables to each decision tree. ADFs are superior to other ensemble methods in terms of prediction accuracy because they achieve global optimization while maintaining randomness. Consequently, we focus on ADF models in this study.
In the following section, we explain how to create decision trees in parallel and how to predict the category of a new data sample.
2.4.2.Model Construction
In this section, we explain the conventional method for constructing an ADF consisting of T decision trees. The depth of a decision tree is denoted as d =1, 2, 3, …, D. The ensemble model for decision trees with depth d is denoted as M_{d}, and each decision tree is denoted as m_{dt} (1 ≤ t ≤ T).
Here, the “purity” of a training data sample (x_{i}, y_{i}) is defined as the ratio of training data samples with the same category y_{i} that have been allocated to the same leaf node. In this method, purity is calculated for every training data sample (x_{i}, y_{i}) Let the leaf node at which the training data sample (x_{i}, y_{i}) arrived be ${s}_{{m}_{dt}}\left({\mathbf{x}}_{i}\right)$ and the purity of data be ${g}_{{m}_{dt}}\left({\mathbf{x}}_{i},\text{\hspace{0.17em}}{y}_{i}\right)$. This purity can be calculated as(1)
where δ(y_{i}, y_{j}) and ${\eta}_{{m}_{dt}}\left({x}_{i},\text{\hspace{0.17em}}{x}_{j}\right)$ are indicator functions. δ(y_{i} is 1 if the category of x_{i} is same as that of x_{j}, and 0 otherwise. ${\eta}_{{m}_{dt}}\left({x}_{i},\text{\hspace{0.17em}}{x}_{j}\right)$ ${\eta}_{{m}_{dt}}\left({x}_{i},\text{\hspace{0.17em}}{x}_{j}\right)$ is 1 if the leaf node to which x_{j} arrives in the decision tree m_{dt} is same as that to which x_{i} arrives, and 0 otherwise.
This method assumes that the prediction model will work well when the purity of all training data is relatively high. Therefore, ADFs let decision trees grow such that the purity of all training data will be high. In order to achieve this, ADFs introduce weights for the training data, which are calculated according to the purity of training data. This method starts with T root nodes and allows them grow in parallel using the weights. ADFs choose split functions such that the purity of all training data increases. After assigning split functions to all nodes at the same depth, ADFs reassign a weight value to each training data sample. By repeating the assignment of split functions and weights, this method constructs decision trees that achieve high purity for all training data. The attribute variable used for a split function is chosen from a randomly restricted subset of attribute variables so that all decision trees have different features.
The weight of training data sample (x_{j}, y_{i}) is denoted as w_{d}(x_{j}, y_{i}), and is determined using the exponential loss function (Freudman et al., 2000), logit function (Freudman and Schapire, 1996), or other appropri ate functions. For example, if the exponential loss function is used, the weight is given as
This weight becomes high when the purity of the training data sample is relatively low. ADFs choose a split function using a criterion such as the Gini index or entropy. For example, when using entropy as a criterion, it chooses the split function based on information gain. Information gain ΔInfo is calculated as
where
where each term of equation (3) is defined as follows:

Info(S) : Entropy of node S.

S_{R}, S_{L}: Split nodes of node S.

P(S_{R}), P(S_{L}): Probablity of node S_{R} and S_{L} on node S.
As mentioned previously, a subset of attribute variables are randomly assigned to each decision tree. Thus, ADFs calculate the information gain of all split functions by using candidates from the chosen attribute variables. The one with the largest information gain is used for the split function. P(k  S) is calculated by using the weights. P(k  S) of decision tree m_{dt} is calculated as(5)
where ${\gamma}_{{m}_{dt}}\left({\mathbf{x}}_{i},\text{\hspace{0.17em}}S\right)$ is an indicator function with a value of 1 if x_{i} arrives at leaf node S, and 0 otherwise.
This score tends to be influenced by the training data samples with large weights. The higher the score of one category, the lower the value of equation (4). Therefore, the split function that improves the purity of data with large weights tends to be selected. In other words, training a model using these weights yields data that would otherwise be difficult to classify and predict accurately.
2.4.3.Prediction Phase
ADFs predict a new data sample category using the ensemble of outputs from the decision trees. Each decision tree provides a degree of certainty for K categories. In the prediction phase, the path of a new data sample is traced from the root node, and tracing continues until a leaf node is found. The rate of occurrence of every category in the training data is calculated in its corresponding leaf node, and these rates become the output of decision tree m_{Dt}. The kth output is denoted as ${f}_{{m}_{Dt}}\left({\mathit{x}}_{N+1},\text{\hspace{0.17em}}k\right)$ and can be calculated as
When predicting the category of a new input data sample using the ensemble of decision trees, ADFs calculate the averages of the outputs from all decision trees for each category. The averaged output, denoted as ${F}_{D}\left({x}_{N+1},\text{\hspace{0.17em}}k\right)$ can be derived by
The category of an input data sample x_{N+1} is predicted to maximize equation (7) in terms of k. Thus, the predicted category $\widehat{k}$ is given as
An example of the prediction phase of ADFs is presented in Figure 3. The outputs of the decision trees are calculated by the leaf nodes corresponding to the input data samples. After calculating all the outputs, they are averaged for each category. Finally, the category with the maximum average is generated as the predicted category.
2.4.4.The Algorithm of the ADFs
ADFs have multiple phases for training and prediction. In this section, we summarize the algorithm for both phases of ADFs.
[Training phase]

Step 1) Assign equal weights to all training data and set all decision trees to a depth of 1. Let d = 1.

Step 2) Assign split functions to all nodes at depth d using specific attribute variables and considering their weights.

Step 3) Calculate the weights of all training data using equation (2).

Step 4) If d +1 becomes D, the predefined maximum depth, then the algorithm ends. If not, return to Step 2 and increment d.
[Prediction phase]

Step 1) The path for a new input data sample is traced from the root node to a leaf node using the split functions. All outputs from each decision tree are calculated using equation (6).

Step 2) The averages of all outputs from the decision trees are calculated for each category.

Step 3) The category of a new input data sample is predicted using equation (8).
3.PROPOSED METHOD
In this section, we explain our proposed method for introducing new indicators for each training data sample and decision tree. Our method improves prediction accuracy by utilizing these indicators. Although the conventional method only considers prediction of the categories of the training data, our method accounts for the accurate prediction of new input data samples.
3.1.Description of the Proposed Method
First, our method divides the training dataset into two subsets for each decision tree before constructing the decision trees. One subset is used for constructing the decision tree. Thus, each decision tree has access to outofbag (OOB) data, which was not used for building the decision tree. In this method, each decision tree predicts the category of each OOB data sample after assigning split functions at the same depth. These prediction results are used to calculate two indicators: the data indicator and tree indicator. The data indicator is used to indicate whether or not the data sample can contribute to the prediction of other training data samples. The tree indicator is used to indicate how accurately the decision tree predicts the category of a new input data sample. Using these indicators helps improve the generalization ability of our method.
3.2.Utilizing Bootstrapped Samples for Constructing Decision Trees
Our method utilizes a bootstrap technique prior to constructing the decision trees in order to divide the training dataset into two subsets for each decision tree. Bootstrapping is a random sampling technique that is mainly used in the RF method. Each decision tree trains on a subset of the training dataset determined by the bootstrap technique. Furthermore, each decision tree is assigned training data not used for tree construction.
After introducing the bootstrap technique, the weights must be changed. Here, the ADFs construct decision trees that increase the purity of training data, and then use the purity to determine the level of success in learning the training data. Therefore, when using the bootstrap technique, weights should be calculated from the average purity in the decision trees that use the training data sample. The average purity of (x_{j}, y_{i}) is represented by ${g}_{{m}_{dt}}^{b}\left({\mathbf{x}}_{i},\text{\hspace{0.17em}\hspace{0.17em}}{y}_{i}\right)$. ${g}_{{m}_{dt}}^{b}\left({\mathbf{x}}_{i},\text{\hspace{0.17em}\hspace{0.17em}}{y}_{i}\right)$ can be derived using(9)
where A_{t} is the set of training data used for constructing the decision tree m_{dt}.
By using the average purity, the new weight of (x_{i}, y_{i}) is represented by ${w}_{d}^{b}\left({x}_{i},\text{\hspace{0.17em}\hspace{0.17em}}{y}_{i}\right)$, which is calculated as(10)
where U_{i} is the set of decision trees that use the training data sample (x_{i}, y_{i}) for construction, and $\xb7$ is the number of components.
3.3.Changing the Weighting Method for the Training Data
As mentioned previously, the conventional ADF method assigns weights to each training data sample and attempts to accurately predict the categories of all training data samples. In other words, this method treats all training data equally. However, if there are outliers in the training dataset and certain decision trees focus on these outliers, the decision trees will tend to overfit the outliers and the generalization ability the entire model will be negatively impacted. Our method changes the way in which weights are assigned to the training data to prevent overfitting. In order to achieve this, the data indicator is introduced. This indicator represents the degree to which a data sample can contribute to the prediction of the category of another sample. The data indicator is utilized for every data sample. This indicator is derived by using the average error probability from the data in the decision trees that are treating the data as OOB data. It is used when assigning weights to the training data. Here, the average error probability of training data sample (x_{i}, y_{i}) is denoted as g^{o}(x_{i}, y_{i}). It is calculated using(11)
where O_{i} is the set of decision trees that do not use the training data sample (x_{i}, y_{i}) for construction.
It represents the difficulty of predicting the category of a data sample when using a model constructed by other training data. If this value is large, there is a possibility that the training data sample is an outlier. Here, the data indicator for (x_{i}, y_{i}) is denoted as N_{d} (x_{i}, y_{i}) It is calculated as(12)
This indicator has a large value when g^{o}(x_{i}, y_{i}) is large. ${w}_{d}^{b}\left({\mathbf{x}}_{i},\text{\hspace{0.17em}}{y}_{i}\right)$ has a large value when the average purity of (x_{i}, y_{i}) in the decision trees using the data sample is small. Therefore, this method emphasizes data samples whose b ${w}_{d}^{b}\left({x}_{i},\text{\hspace{0.17em}}{y}_{i}\right)$ and data indicator are large. This method assigns weight to each training data sample by using ${w}_{d}^{b}\left({x}_{i},\text{\hspace{0.17em}}{y}_{i}\right)$ and the data indicator. Furthermore, it introduces a tuning parameter and changes the degree of mixing equations for each decision tree. If it uses exponential loss, the weight of a decision tree m_{dt} can be expressed as
where C_{dt} is a tuning parameter for decision tree m_{dt}.
Then, this method sets C_{dt} by using PUR_{dt} and OOBP_{dt} as shown in
where B_{t} is the set of training data samples which are not used for constructing decision tree m_{dt}.
Here, PUR_{dt} is the average purity of the data samples used when constructing decision tree m_{dt}. OOBP_{dt} is the average purity of the data samples not used when constructing decision tree m_{dt}. By using C_{dt}, decision trees with low generalization ability mainly focus on nonoutlier data samples, and decision trees with high generalization ability mainly focus on the data samples with low purity. Therefore, this method effectively improves purity and generalization ability while allowing each decision tree to have different features.
3.4.Reassigning Attribute Variables Assigned to Decision Trees with Large OOB Errors
Similar to section 3.3, this method utilizes the results of category prediction from the OOB data. Specifically, this method calculates the error rate of category prediction in the OOB data for each decision tree. These error rates are then introduced as a new indicator, the tree indicator. The error rate of decision tree , m_{dt} denoted by , OOBE_{dt} can be expressed as(17)
This tree indicator represents the decision tree’s error rate for new data samples. If OOBE_{dt} is high, it can be said that the generalization ability of the decision tree is low. This method effectively solves the problem of low generalization ability of the decision tree. In order to accomplish this, it uses the fact that the method for selecting attribute variables affects prediction accuracy (Genuer et al., 2010). As mentioned before, each decision tree is allocated only a subset of attribute variables so that all decision trees have different features. Therefore, we can utilize the tree indicator as a measure of the importance of attribute variables from the perspective of improving generalization ability in the learning phase. Specifically, this method changes the attribute variables allocated to decision trees with higher a OOBE_{dt} when assigning split functions. For example, the attribute variables for decision trees whose scores fall within the top β % of OOBE_{dt} may be changed. When reallocating attribute variables to decision trees, this method selects attribute variables that have not been used to be attached to each decision tree. Thus, each decision tree is given different features and the ensemble model can consider many different interactions between attribute variables. Consequently, this method achieves excellent diversity between trees while considering the importance of various attribute variables. All of these advantages lead to improved prediction accuracy.
3.5.Algorithm for the Proposed Method
The prediction phase is the same as in conventional ADFs. Thus, we summarize only the steps of the training phase of the proposed method in this section.
[Training phase]

Step 1) Assign a subset of the training dataset using the bootstrap technique and allocate a portion of the attribute variables to each decision tree.

Step 2) Assign equal weights to all training data samples and make trees of depth 1. Let d = 1.

Step 3) Assign split functions to all nodes in at depth d by using the specific attribute variables based on their weights.

Step 4) Calculate new weights for all training data samples using equation (13).

Step 5) Reassign the set of attribute variables to deci sion trees whose OOBE_{dt} are relatively high.

Step 6) If d +1 becomes D, the predefined maximum depth, the algorithm ends. If not, increment d and return to Step 3.
4.EXPERIMENT
4.1.Experimental Conditions
In order to verify the effectiveness of our proposed method, we conducted simulation experiments using the UCI machine learning repository (Blake and Merz, 1998). In this experiment, we compared the prediction accuracy of conventional ADFs to that of the proposed method. We measured prediction accuracy by error rate using(18)
where err is the number of test data samples predicted incorrectly and test is the total number of test data samples.
The simulation was executed 50 times. The test data and training data were randomized for each simulation and the results were averaged. We used four datasets from the UCI machine learning repository for the simulations.
Detailed simulation information is presented in Table 1.
Prior to comparison, we performed preliminary experiments to determine the appropriate simulation parameters. The parameter β was determined by performing crossvalidation on the training data in each dataset. The remaining parameters were set to be larger by specific margins. This is because increasing the parameter values monotonously improved prediction accuracy up to a specific threshold without causing overfitting. The parameters used are listed in Table 2.
4.2.Experiments Results for Prediction Accuracy
The results of the error rate comparison are presented in Table 3.
As seen in Table 3, the proposed method is superior to conventional ADFs and RF except for the musk dataset. In particular, the effectiveness of the proposed method is remarkable for the “CNAE9” and “LSVT” datasets. This is because the total number of attribute variables is relatively large, while many of them are unimportant for category prediction. The proposed method can utilize important attribute variables many times and eliminate unimportant ones. In recent years, the average size of attribute variables and datasets has increased owing to the plummeting costs of information storage. Thus, it can be said that the proposed method will only become more effective for real world problems in the future.
Moreover, the proposed method outperforms ADFs and RF when the number of attribute variables is small e.g., using hill, diabetic, ionosphere, index, and vehicle datasets. This shows that the proposed method performs well even if the number of attribute variables is small.
In contrast, the proposed method performed poorly for the musk dataset, possibly because of the presence of a few outliers in this dataset since the error rates of all methods were extremely low. Therefore, the data indicator seems to prevent the overfitting of training data and decreased prediction accuracy.
However, a dataset with a small number of outliers may be rare. Therefore, this method is effective when applied to real world problems because it can reduce the influence of outliers.
4.3.Experiment to Determine Robustness of the Proposed Method
In order to verify the robustness of the proposed method, its generalization ability was compared to that of the conventional method through an experiment with a dataset containing artificially added outliers. The dataset “ionosphere” was used as basis for this experiment. In order to add outliers to the training data, the categories of some data samples were changed randomly. In this experiment, the error rates of the proposed and conventional methods were compared. The categories of training data samples were randomly changed 50 times to create different rates of outliers. All other conditions were the same as in the previous experiments. The results of the experiment are shown in Figure 4.
As seen in Figure 4, when the number of the outliers is small, the error rates of two methods are similar. As the number of outliers increase, the error rates of conventional ADFs become greater compared to the proposed method. Therefore, it can be said that the proposed method is robust because reallocating attribute variables improved prediction accuracy, and the new weighting method prevented overfitting. From these results, we can say that the proposed method is superior to the conventional method when using a dataset with many outliers.
5.CONCLUSIONS AND FUTURE WORK
We proposed a new decision tree method that introduces new indicators based on ADFs. The proposed method changes the technique for calculating weights and uses the weights to reallocate sets of attribute variables. The results of simulation experiments showed that our proposed method is superior to conventional ADFs.
In the future, we plan to research additional ways to utilize the two indicators and improve prediction accuracy. Additionally, a method to automatically determine the value of parameter β is needed.