Journal Search Engine
Search Advanced Search Adode Reader(link)
Download PDF Export Citaion korean bibliography PMC previewer
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.16 No.3 pp.384-391
DOI : https://doi.org/10.7232/iems.2017.16.3.384

Adaptive Prediction Method Based on Alternating Decision Forests with Considerations for Generalization Ability

Shotaro Misawa*, Kenta Mikawa, Masayuki Goto
Graduate School of Creative Science and Engineering, Waseda University, Japan
School of Engineering, Shonan Institute of Technology, Japan
School of Creative Science and Engineering, Waseda University, Japan
Corresponding Author, misawa@fuji.waseda.jp
September 1, 2016 March 6, 2017 August 3, 2017

ABSTRACT

Many machine learning algorithms have been proposed and applied to a wide range of prediction problems in the field of industrial management. Lately, the amount of data is increasing and machine learning algorithms with low computational costs and efficient ensemble methods are needed. Alternating Decision Forest (ADF) is an efficient ensemble method known for its high performance and low computational costs. ADFs introduce weights representing the degree of prediction accuracy for each piece of training data and randomly select attribute variables for each node. This method can effectively construct an ensemble model that can predict training data accurately while allowing each decision tree to retain different features. However, outliers can cause overfitting, and since candidates of branch conditions vary for nodes in ADFs, there is a possibility that prediction accuracy will deteriorate because the fitness of training data is highly restrained. In order to improve prediction accuracy, we focus on the prediction results for new data. That is to say, we introduce bootstrap sampling so that the algorithm can generate out-of-bag (OOB) datasets for each tree in the training phase. Additionally, we construct an effective ensemble of decision trees to improve generalization ability by considering the prediction accuracy for OOB data. To verify the effectiveness of the proposed method, we conduct simulation experiments using the UCI machine learning repository. This method provides robust and accurate predictions for datasets with many attribute variables.


초록


    1.INTRODUCTION

    Machine learning algorithms are widely used to solve various kinds of problems in the field of industrial management. We focus on the problem of predicting discrete categorical labels, which can be solved by constructing a prediction model using a set of training data consisting of vectors with attribute variables and their categorical labels. In order to solve these problems, deep neural network algorithms are widely used (Cortes et al., 2017). However, these algorithms have high computational costs and require a lot of training data. Therefore, ensemble algorithms that combine simple machine learning models have been applied for various problems (Opitz and Maclin, 1999) because of their low computational costs and high performance without large training data. In recent years increasing attention has been focused on the stacked generalization algorithm (Wolpert, 1992) that combines ensemble methods because of its high performance. The major ensemble method is Random Forests (RF) (Breiman, 2001) owing to the ease of constructing the model and its high performance. The RF method constructs decision trees (Breiman et al., 1984) randomly and independently. In order to improve RF, methods introducing a minimized loss function to the set of decision trees have been proposed and shown to be effective. Two of these methods are Alternating Decision Forests (ADFs) (Schulter et al., 2013) and Global Refinement of Random Forests (Ren et al., 2015). We focus on ADFs because it is superior to other methods in terms of computational cost and prediction accuracy.

    ADFs introduce weights that represent degrees of classification accuracy for the training data. These weights are utilized to expand decision trees during the learning phase in order to improve predictability for training data. When weights are utilized during the learning phase, the prediction accuracy of ADFs rises significantly because the weights enable the construction of global optimal decision trees. However, ADFs have two major problems. First, the learning phase of ADFs treats all training data equally, meaning that it can be strongly affected by outliers. Second, ADFs randomly arrange attribute variables for each decision tree and all attribute variables are treated equally. Thus, ADFs do not consider the importance of each attribute variable. The prediction accuracy of ADFs may be improved by solving these problems.

    Based on the above discussion, we propose a new ADF prediction method enhanced by changing the method of assigning weights and reassigning the sets of attribute variables for certain decision trees during the learning phase to improve prediction accuracy. In order to accomplish these enhancements, our method divides the training dataset into two subsets for each decision tree. One subset is used for expanding the decision tree, while the other subset is used for introducing new indicators. One indicator represents an evaluation of whether or not each data point is an outlier. Another indicator represents the importance of a set of attribute variables for predicting new data. Our method utilizes these indicators and places an emphasis on non-outlier data and important attribute variables to improve prediction accuracy.

    In order to verify the effectiveness and robustness of the proposed method, we conducted a simulation experiment with data obtained from the UCI machine learning repository (Blake and Merz, 1998).

    In Section 2, the preliminaries, including basic information on decision tree and representative ensemble methods are explained. In Section 3, our proposed method and its algorithm are discussed. To verify the effectiveness of our proposed method, we conducted a classification experiment and compared the error rates with those from conventional methods in Section 4. Finally, the discussion and conclusion are stated in Section 5.

    2.PRELIMINARIES

    2.1.Problem Formulation

    The training data is denoted as { { x i , y i } } i = 1 N , where N is the number of training data samples and xi is a q- dimensional input vector consisting of attribute variables. Let the set of category labels be C = { 1 ,    2 , , k , , K } . The category label corresponding to xi is yi ∈ C. The prediction problem is to predict the correct category label yN+1 when a new data sample xN+1 is input into a learned classification rule.

    2.2.Decision Tree

    A decision tree is a prediction model with a tree-like structure consisting of nodes and branches. The set of nodes in a decision tree consists of leaf nodes and decision nodes. A decision node has two or more branches and divides its allocated dataset among its child nodes based on a split function for a specific attribute variable. Leaf nodes are at the bottom of the decision tree. Each data sample is delivered from the root node to a child node through the split function. As this process is repeated, each data sample eventually arrives at a leaf node, which has a classification rule. For example, a classification rule could be the majority vote by the training data samples allocated to the leaf node. A category label for a new input data sample x N + 1 is predicted by using the classification rule attached to the leaf node to which the data is delivered from the root node through a series of split functions.

    The split function at each decision node is set such that the distribution of categories from the training data to its child nodes is as biased as possible. Local optimum split functions are attached to decision nodes in order to raise the amount of training data with a specific category being allocated to a leaf node.

    An example decision tree is presented in Figure 1. In this decision tree, nodes No. 1 and 3 are decision nodes, while nodes No. 2, 4, and 5 are leaf nodes.

    2.3.Random Forests

    There is room for improvement in the generalization ability of a decision tree. Ensemble methods for decision trees have become well known and are often effective in improving prediction accuracy in real-world problems. Many different ensemble methods have been applied to the decision tree model owing to their high compatibility. An example on predicting the category of a new input data sample using an ensemble method for decision trees is presented in Figure 2. As seen in the figure, various decision trees are created and combined. The RF (Breiman, 2001) is the most popular among ensemble methods and has relatively high performance.

    The RF makes decision trees grow independently, and all decision trees are trained by different features. In order to achieve this, each decision tree is randomly assigned a portion of the attribute variables when creating the tree structure. The split functions for each decision node are then selected from the restricted subset of attribute variables. Furthermore, each decision tree learns only a subset of the training data. This method allows every decision tree to have different features, because each tree focuses on different training data and attribute variables. An ensemble of these trees can be expected to improve prediction accuracy compared to the original decision tree model.

    2.4.Alternating Decision Forests

    2.4.1.The Description of ADFs

    ADFs aim to construct an ensemble prediction model of decision trees, such that the trees grow toward the accurate prediction of all training data and possess different features. In order to achieve this, ADFs assign a weight to each training data sample. By using these weights, ADFs can construct trees that predict all training data accurately, and each decision tree learns to focus on different training data. Similar to RF, this method assigns a portion of the attribute variables to each decision tree. ADFs are superior to other ensemble methods in terms of prediction accuracy because they achieve global optimization while maintaining randomness. Consequently, we focus on ADF models in this study.

    In the following section, we explain how to create decision trees in parallel and how to predict the category of a new data sample.

    2.4.2.Model Construction

    In this section, we explain the conventional method for constructing an ADF consisting of T decision trees. The depth of a decision tree is denoted as d =1, 2, 3, …, D. The ensemble model for decision trees with depth d is denoted as Md, and each decision tree is denoted as mdt (1 ≤ tT).

    Here, the “purity” of a training data sample (xi, yi) is defined as the ratio of training data samples with the same category yi that have been allocated to the same leaf node. In this method, purity is calculated for every training data sample (xi, yi) Let the leaf node at which the training data sample (xi, yi) arrived be s m d t ( x i ) and the purity of data be g m d t ( x i , y i ) . This purity can be calculated as(1)

    g m d t ( x i , y i ) = P ( y i | s m d t ( x i ) ) = j = 1 N δ ( y i , y j ) × η m d t ( x i , x j ) j = 1 N η m d t ( x i , x j )
    (1)

    where δ(yi, yj) and η m d t ( x i , x j ) are indicator functions. δ(yi is 1 if the category of xi is same as that of xj, and 0 otherwise. η m d t ( x i , x j ) η m d t ( x i , x j ) is 1 if the leaf node to which xj arrives in the decision tree mdt is same as that to which xi arrives, and 0 otherwise.

    This method assumes that the prediction model will work well when the purity of all training data is relatively high. Therefore, ADFs let decision trees grow such that the purity of all training data will be high. In order to achieve this, ADFs introduce weights for the training data, which are calculated according to the purity of training data. This method starts with T root nodes and allows them grow in parallel using the weights. ADFs choose split functions such that the purity of all training data increases. After assigning split functions to all nodes at the same depth, ADFs reassign a weight value to each training data sample. By repeating the assignment of split functions and weights, this method constructs decision trees that achieve high purity for all training data. The attribute variable used for a split function is chosen from a randomly restricted subset of attribute variables so that all decision trees have different features.

    The weight of training data sample (xj, yi) is denoted as wd(xj, yi), and is determined using the exponential loss function (Freudman et al., 2000), logit function (Freudman and Schapire, 1996), or other appropri- ate functions. For example, if the exponential loss function is used, the weight is given as

    w d ( x i ,    y i ) = exp { 1 T t = 1 T g m d t ( x i , y i ) }
    (2)

    This weight becomes high when the purity of the training data sample is relatively low. ADFs choose a split function using a criterion such as the Gini index or entropy. For example, when using entropy as a criterion, it chooses the split function based on information gain. Information gain ΔInfo is calculated as

    Δ I n f o = I n f o ( S ) P ( S R ) × I n f o ( S R ) P ( S L ) × I n f o ( S L )
    (3)

    where

    I n f o ( S ) = k = 1 K P ( k | S ) log  P ( k | S )
    (4)

    where each term of equation (3) is defined as follows:

    • Info(S) : Entropy of node S.

    • SR, SL: Split nodes of node S.

    • P(SR), P(SL): Probablity of node SR and SL on node S.

    As mentioned previously, a subset of attribute variables are randomly assigned to each decision tree. Thus, ADFs calculate the information gain of all split functions by using candidates from the chosen attribute variables. The one with the largest information gain is used for the split function. P(k | S) is calculated by using the weights. P(k | S) of decision tree mdt is calculated as(5)

    P ( k | S ) = i = 1 N w d ( x i , y i ) × δ ( y i , k ) × γ m d t ( x i , S ) i = 1 N w d ( x i , y i ) × γ m d t ( x i , S )
    (5)

    where γ m d t ( x i , S ) is an indicator function with a value of 1 if xi arrives at leaf node S, and 0 otherwise.

    This score tends to be influenced by the training data samples with large weights. The higher the score of one category, the lower the value of equation (4). Therefore, the split function that improves the purity of data with large weights tends to be selected. In other words, training a model using these weights yields data that would otherwise be difficult to classify and predict accurately.

    2.4.3.Prediction Phase

    ADFs predict a new data sample category using the ensemble of outputs from the decision trees. Each decision tree provides a degree of certainty for K categories. In the prediction phase, the path of a new data sample is traced from the root node, and tracing continues until a leaf node is found. The rate of occurrence of every category in the training data is calculated in its corresponding leaf node, and these rates become the output of decision tree mDt. The k-th output is denoted as f m D t ( x N + 1 , k ) and can be calculated as

    f m D t ( x N + 1 , k ) = P ( k | s m D t ( x N + 1 ) ) = i = 1 N δ ( y i , k ) × η m D t ( x N + 1 , x i ) η m D t ( x N + 1 , x i )
    (6)

    When predicting the category of a new input data sample using the ensemble of decision trees, ADFs calculate the averages of the outputs from all decision trees for each category. The averaged output, denoted as F D ( x N + 1 , k ) can be derived by

    F D ( x N + 1 , k ) = 1 T t = 1 T f m D t ( x N + 1 , k )
    (7)

    The category of an input data sample xN+1 is predicted to maximize equation (7) in terms of k. Thus, the predicted category k ^ is given as

    k ^ = arg max k F D ( x N + 1 , k )
    (8)

    An example of the prediction phase of ADFs is presented in Figure 3. The outputs of the decision trees are calculated by the leaf nodes corresponding to the input data samples. After calculating all the outputs, they are averaged for each category. Finally, the category with the maximum average is generated as the predicted category.

    2.4.4.The Algorithm of the ADFs

    ADFs have multiple phases for training and prediction. In this section, we summarize the algorithm for both phases of ADFs.

    [Training phase]

    • Step 1) Assign equal weights to all training data and set all decision trees to a depth of 1. Let d = 1.

    • Step 2) Assign split functions to all nodes at depth d using specific attribute variables and considering their weights.

    • Step 3) Calculate the weights of all training data using equation (2).

    • Step 4) If d +1 becomes D, the predefined maximum depth, then the algorithm ends. If not, return to Step 2 and increment d.

    [Prediction phase]

    • Step 1) The path for a new input data sample is traced from the root node to a leaf node using the split functions. All outputs from each decision tree are calculated using equation (6).

    • Step 2) The averages of all outputs from the decision trees are calculated for each category.

    • Step 3) The category of a new input data sample is predicted using equation (8).

    3.PROPOSED METHOD

    In this section, we explain our proposed method for introducing new indicators for each training data sample and decision tree. Our method improves prediction accuracy by utilizing these indicators. Although the conventional method only considers prediction of the categories of the training data, our method accounts for the accurate prediction of new input data samples.

    3.1.Description of the Proposed Method

    First, our method divides the training dataset into two subsets for each decision tree before constructing the decision trees. One subset is used for constructing the decision tree. Thus, each decision tree has access to out-of-bag (OOB) data, which was not used for building the decision tree. In this method, each decision tree predicts the category of each OOB data sample after assigning split functions at the same depth. These prediction results are used to calculate two indicators: the data indicator and tree indicator. The data indicator is used to indicate whether or not the data sample can contribute to the prediction of other training data samples. The tree indicator is used to indicate how accurately the decision tree predicts the category of a new input data sample. Using these indicators helps improve the generalization ability of our method.

    3.2.Utilizing Bootstrapped Samples for Constructing Decision Trees

    Our method utilizes a bootstrap technique prior to constructing the decision trees in order to divide the training dataset into two subsets for each decision tree. Bootstrapping is a random sampling technique that is mainly used in the RF method. Each decision tree trains on a subset of the training dataset determined by the bootstrap technique. Furthermore, each decision tree is assigned training data not used for tree construction.

    After introducing the bootstrap technique, the weights must be changed. Here, the ADFs construct decision trees that increase the purity of training data, and then use the purity to determine the level of success in learning the training data. Therefore, when using the bootstrap technique, weights should be calculated from the average purity in the decision trees that use the training data sample. The average purity of (xj, yi) is represented by g m d t b ( x i ,    y i ) . g m d t b ( x i ,    y i ) can be derived using(9)

    g m d t b ( x i ,    y i ) = j A t δ ( y i , y j ) × η m d t ( x i , x j ) j A t η m d t ( x i , x j )
    (9)

    where At is the set of training data used for constructing the decision tree mdt.

    By using the average purity, the new weight of (xi, yi) is represented by w d b ( x i ,    y i ) , which is calculated as(10)

    w d b ( x i , y i ) = exp { 1 | U i | × t U t g m d t b ( x i , y i ) }
    (10)

    where Ui is the set of decision trees that use the training data sample (xi, yi) for construction, and | · | is the number of components.

    3.3.Changing the Weighting Method for the Training Data

    As mentioned previously, the conventional ADF method assigns weights to each training data sample and attempts to accurately predict the categories of all training data samples. In other words, this method treats all training data equally. However, if there are outliers in the training dataset and certain decision trees focus on these outliers, the decision trees will tend to overfit the outliers and the generalization ability the entire model will be negatively impacted. Our method changes the way in which weights are assigned to the training data to prevent overfitting. In order to achieve this, the data indicator is introduced. This indicator represents the degree to which a data sample can contribute to the prediction of the category of another sample. The data indicator is utilized for every data sample. This indicator is derived by using the average error probability from the data in the decision trees that are treating the data as OOB data. It is used when assigning weights to the training data. Here, the average error probability of training data sample (xi, yi) is denoted as go(xi, yi). It is calculated using(11)

    g o ( x i , y i ) = 1 | O i | × t O i j A t δ ( y i , y j ) × η m d t ( x i , x j ) j A t η m d t ( x i , x j )
    (11)

    where Oi is the set of decision trees that do not use the training data sample (xi, yi) for construction.

    It represents the difficulty of predicting the category of a data sample when using a model constructed by other training data. If this value is large, there is a possibility that the training data sample is an outlier. Here, the data indicator for (xi, yi) is denoted as Nd (xi, yi) It is calculated as(12)

    N d ( x i , y i ) = exp { { 1 g o ( x i , y i ) } }
    (12)

    This indicator has a large value when go(xi, yi) is large. w d b ( x i , y i ) has a large value when the average purity of (xi, yi) in the decision trees using the data sample is small. Therefore, this method emphasizes data samples whose b w d b ( x i , y i ) and data indicator are large. This method assigns weight to each training data sample by using w d b ( x i , y i ) and the data indicator. Furthermore, it introduces a tuning parameter and changes the degree of mixing equations for each decision tree. If it uses exponential loss, the weight of a decision tree mdt can be expressed as

    w d t ( x i , y i ) = w d b ( x i , y i ) + C d t × N d ( x i ,    y i )
    (13)

    where Cdt is a tuning parameter for decision tree mdt.

    Then, this method sets Cdt by using PURdt and OOBPdt as shown in

    C d t = P U R d t O O B P d t
    (14)

    where(15)(16)

    P U R d t = 1 | A t | × j A t i A t δ ( y i , y j ) × η m d t ( x i , x j ) i A t η m d t ( x i , x j )
    (15)

    O O B P d t = 1 | B t | × j B t i A t δ ( y i , y j ) × η m d t ( x i , x j ) i A t η m d t ( x i , x j )
    (16)

    where Bt is the set of training data samples which are not used for constructing decision tree mdt.

    Here, PURdt is the average purity of the data samples used when constructing decision tree mdt. OOBPdt is the average purity of the data samples not used when constructing decision tree mdt. By using Cdt, decision trees with low generalization ability mainly focus on nonoutlier data samples, and decision trees with high generalization ability mainly focus on the data samples with low purity. Therefore, this method effectively improves purity and generalization ability while allowing each decision tree to have different features.

    3.4.Reassigning Attribute Variables Assigned to Decision Trees with Large OOB Errors

    Similar to section 3.3, this method utilizes the results of category prediction from the OOB data. Specifically, this method calculates the error rate of category prediction in the OOB data for each decision tree. These error rates are then introduced as a new indicator, the tree indicator. The error rate of decision tree , mdt denoted by , OOBEdt can be expressed as(17)

    O O B E d t = 1 | B t | j B t i A t { 1 δ ( y i , y j ) } × η m d t ( x i , x j ) i A t η m d t ( x i , x j )
    (17)

    This tree indicator represents the decision tree’s error rate for new data samples. If OOBEdt is high, it can be said that the generalization ability of the decision tree is low. This method effectively solves the problem of low generalization ability of the decision tree. In order to accomplish this, it uses the fact that the method for selecting attribute variables affects prediction accuracy (Genuer et al., 2010). As mentioned before, each decision tree is allocated only a subset of attribute variables so that all decision trees have different features. Therefore, we can utilize the tree indicator as a measure of the importance of attribute variables from the perspective of improving generalization ability in the learning phase. Specifically, this method changes the attribute variables allocated to decision trees with higher a OOBEdt when assigning split functions. For example, the attribute variables for decision trees whose scores fall within the top β % of OOBEdt may be changed. When reallocating attribute variables to decision trees, this method selects attribute variables that have not been used to be attached to each decision tree. Thus, each decision tree is given different features and the ensemble model can consider many different interactions between attribute variables. Consequently, this method achieves excellent diversity between trees while considering the importance of various attribute variables. All of these advantages lead to improved prediction accuracy.

    3.5.Algorithm for the Proposed Method

    The prediction phase is the same as in conventional ADFs. Thus, we summarize only the steps of the training phase of the proposed method in this section.

    [Training phase]

    • Step 1) Assign a subset of the training dataset using the bootstrap technique and allocate a portion of the attribute variables to each decision tree.

    • Step 2) Assign equal weights to all training data samples and make trees of depth 1. Let d = 1.

    • Step 3) Assign split functions to all nodes in at depth d by using the specific attribute variables based on their weights.

    • Step 4) Calculate new weights for all training data samples using equation (13).

    • Step 5) Reassign the set of attribute variables to deci sion trees whose OOBEdt are relatively high.

    • Step 6) If d +1 becomes D, the predefined maximum depth, the algorithm ends. If not, increment d and return to Step 3.

    4.EXPERIMENT

    4.1.Experimental Conditions

    In order to verify the effectiveness of our proposed method, we conducted simulation experiments using the UCI machine learning repository (Blake and Merz, 1998). In this experiment, we compared the prediction accuracy of conventional ADFs to that of the proposed method. We measured prediction accuracy by error rate using(18)

    error rate ( % ) = e r r t e s t × 100
    (18)

    where err is the number of test data samples predicted incorrectly and test is the total number of test data samples.

    The simulation was executed 50 times. The test data and training data were randomized for each simulation and the results were averaged. We used four datasets from the UCI machine learning repository for the simulations.

    Detailed simulation information is presented in Table 1.

    Prior to comparison, we performed preliminary experiments to determine the appropriate simulation parameters. The parameter β was determined by performing cross-validation on the training data in each dataset. The remaining parameters were set to be larger by specific margins. This is because increasing the parameter values monotonously improved prediction accuracy up to a specific threshold without causing overfitting. The parameters used are listed in Table 2.

    4.2.Experiments Results for Prediction Accuracy

    The results of the error rate comparison are presented in Table 3.

    As seen in Table 3, the proposed method is superior to conventional ADFs and RF except for the musk dataset. In particular, the effectiveness of the proposed method is remarkable for the “CNAE-9” and “LSVT” datasets. This is because the total number of attribute variables is relatively large, while many of them are unimportant for category prediction. The proposed method can utilize important attribute variables many times and eliminate unimportant ones. In recent years, the average size of attribute variables and datasets has increased owing to the plummeting costs of information storage. Thus, it can be said that the proposed method will only become more effective for real world problems in the future.

    Moreover, the proposed method outperforms ADFs and RF when the number of attribute variables is small e.g., using hill, diabetic, ionosphere, index, and vehicle datasets. This shows that the proposed method performs well even if the number of attribute variables is small.

    In contrast, the proposed method performed poorly for the musk dataset, possibly because of the presence of a few outliers in this dataset since the error rates of all methods were extremely low. Therefore, the data indicator seems to prevent the overfitting of training data and decreased prediction accuracy.

    However, a dataset with a small number of outliers may be rare. Therefore, this method is effective when applied to real world problems because it can reduce the influence of outliers.

    4.3.Experiment to Determine Robustness of the Proposed Method

    In order to verify the robustness of the proposed method, its generalization ability was compared to that of the conventional method through an experiment with a dataset containing artificially added outliers. The dataset “ionosphere” was used as basis for this experiment. In order to add outliers to the training data, the categories of some data samples were changed randomly. In this experiment, the error rates of the proposed and conventional methods were compared. The categories of training data samples were randomly changed 50 times to create different rates of outliers. All other conditions were the same as in the previous experiments. The results of the experiment are shown in Figure 4.

    As seen in Figure 4, when the number of the outliers is small, the error rates of two methods are similar. As the number of outliers increase, the error rates of conventional ADFs become greater compared to the proposed method. Therefore, it can be said that the proposed method is robust because reallocating attribute variables improved prediction accuracy, and the new weighting method prevented overfitting. From these results, we can say that the proposed method is superior to the conventional method when using a dataset with many outliers.

    5.CONCLUSIONS AND FUTURE WORK

    We proposed a new decision tree method that introduces new indicators based on ADFs. The proposed method changes the technique for calculating weights and uses the weights to reallocate sets of attribute variables. The results of simulation experiments showed that our proposed method is superior to conventional ADFs.

    In the future, we plan to research additional ways to utilize the two indicators and improve prediction accuracy. Additionally, a method to automatically determine the value of parameter β is needed.

    ACKNOWLEDGEMENT

    The authors would like to thank Mr. Gendo Kumoi, Dr. Haruka Yamashita, and all members of the Goto Laboratory for their support of our research. A portion of this study was supported by JSPS KAKENHI Grant Numbers 26282090 and 26560167.

    Figure

    IEMS-16-384_F1.gif

    An example decision tree.

    IEMS-16-384_F2.gif

    An example of prediction using ensemble of trees.

    IEMS-16-384_F3.gif

    Example of the prediction method of ADFs.

    IEMS-16-384_F4.gif

    Rate of outliers and error rate.

    Table

    Features of datasets used in the simulations

    Experimental parameters

    Error rates of each dataset

    REFERENCES

    1. Blake C. , Merz C. (1998) UCI repository of machine learning database, Available from: http://www.ics.uci.edu/-mlearn/mlrepository.html
    2. Breiman L. (2001) Random forests , Mach. Learn, Vol.45 (1) ; pp.5-32
    3. Breiman L. , Freidman J. , Olshen R. , Stone C. (1984) Classification and Regression Trees, IWadsworth,
    4. Cortes C. , Gonzalvo X. , Kuznetsov V. , Mohri M. , Yang S. (2017) AdaNet: Adaptive Structural Learning of Artificial Neural Networks,
    5. Freudman H. , Hastie R. , Tibshirani T. (2000) Additive logistic regression, a statistical view of boosting , Ann. Stat, Vol.28 (2) ; pp.337-407
    6. Freudman Y. , Schapire R. (1996) Experiments with a new boosting algorithm , Proceedings of the Thirteenth International Conference, ; pp.148-156
    7. Genuer R. , Jean-Michel P. , Christine T. (2010) Variable selection using random forests , Pattern Recognit. Lett, Vol.31 (14) ; pp.2225-2236
    8. Opitz D. , Maclin R. (1999) Popular ensemble methods: An empirical study , J. Artif. Intell. Res, Vol.11 ; pp.169-198
    9. Ren S. , Cao X. , Wei Y. , Sun J. (2015) Global refinement of random forest , IEEE Conference on Computer Vision and Pattern Recognition, ; pp.723-730
    10. Schulter S. , Leistner C. , Wohlhart P. , Roth P.M. , Bischof H. (2013) Alternating regression forests for object detection and pose estimation , IEEE International Conference,
    11. Schulter S. , Wohlhrt C. , Leistner C. , Saffari A. , Roth P. , Bischof H. (2013) Alternating decision forests , IEEE Conference on Computer Vision and Pattern Recognition, ; pp.508-515
    12. Wolpert D. (1992) Stacked generalization , Neural Netw, Vol.5 (2) ; pp.241-259