1.INTRODUCTION
With rapid development on information society, the technology for automatic document classification has become even more important. The document classification is now an important tool of business analytics even in the field of industrial management. In document classification, the machine learning techniques are usually useful to construct an appropriate classifier. After a classification rule is learned from the prelabeled training data (supervised data), a new input data can be classified into an appropriate category according to its content. On the usual setting, it is assumed that a new input data can be classified into any of categories observed in the training data (Ishida, 2006). However, if a new data belonging to an unobserved category which does not exist in the training data is input to the learned classifier, then such data cannot be classified correctly.
As the method which can discriminate whether the data belongs to unobserved categories or not, reject rules in the classification have been studied (Chow, 1970). This study first needs the setting of a threshold in some measurement, then detects the data exceeding the threshold as an outlier. Therefore, the threshold should be determined in according with the number of data to be detected. However, it is possibly difficult to determine an optimum threshold in practice.
To solve the above problem, Arakawa et al. (2013) proposed the method which models the generative probabilities of documents with a mixture of Polya distributions and estimates the optimum category within all observed and unobserved categories, where it is assumed that documents in each category are generated from each single Polya distribution (Arakawa et al., 2013). This method is based on the idea of the partially supervised classification (Amini and Usunier, 2015, Baluja, 1998, Bishop, 2010, Castelli and Cover, 1996, Liu et al., 2002; Miller and Uyar, 1997, Nigam et al., 2000, Seeger, 2001, Szummer, 2002, Szummer and Jaakkola, 2002a, 2002b). By introducing the idea of partially supervised classification (Nigam et al., 2000), the problem of the unobserved categories was solved in Arakawa’s method. Here, a mixture of Polya distributions is the model which derives from setting Dirichlet distributions (Minka, 2003) as a prior to the unigram mixture that is based on the multinomial distribution (Masada et al., 2007, Sadamitsu et al., 2005, Ushiama et al., 2010). Differing from the multitopic models such as LDA (Latent Dirichlet Allocation) which represents plural latent topics of a document by assuming each word in the document has their own topic respectively (Blei et al., 2003), a mixture of Polya distributions is known as the single topic model assuming that each document has a single topic. This single topic model can give categories of unlabeled object data more explicitly. Therefore, for detecting the unobserved categories, the single topic model can be more suited.
However, the statistical characteristics of document categories are generally more complicated in practice than the situation assumed for Arakawa’s conventional method. For example, the documents assigned to the category “sports” are considered as a mixture of various underlying latent topics such as “baseball,” “soccer,” “tennis,” and so on. In Arakawa’s method, each category is modeled by a single Polya distribution, so that this method cannot represent the variation of word frequency depending on unobserved latent topics. In other words, Arakawa’s conventional method cannot consider plural latent topics in a category.
In this paper, we propose the method which models generative probabilities of documents in a category with a mixture of Polya distributions (the mixed Polya distribution) by letting a single Polya distribution correspond to each latent topic. We show that the proposed method outperforms Arakawa’s conventional one from the viewpoint of classification accuracy through the result of classification experiment by using a set of English newspaper articles.
2.PRELIMINARIES
In this section, we first describe the definitions in the document classification problem, and then we show the previous study of the mixture of Polya distributions where a single Polya distribution is assumed for the generative probability of a document in a category.
2.1.Document Classification
First, we denote $D={\left\{\left({x}_{n},\text{\hspace{0.17em}}{y}_{n}\right)\right\}}_{n=1}^{N}$ as the training dataset for learning and C = {c_{1}, c_{2} , …, c_{K} } as a set of categories, where y_{n} ∈ C. Here, N is the number of training data, and K is the number of categories. The training data means labeled data to be supervised to build a classification model. The statistical characteristic of each document x_{n} is expressed by bagofwords. Letting W = {w_{1}, w_{2}, …, w_{V}} be a set of words which appeared in all training documents, the nth document x_{n} is represented by discrete vector x_{n} = (x_{n}_{1}, x_{n}_{2}, …, x_{nV} ), where w_{v} is the vth word in an above word set W, and x_{nv} is the vth word frequency in nth document. On the document classification problem, after a set of training data is given, a new unlabeled input data ( x_{n}_{+1}, y_{n}_{+1}) is classified into a category in C. More precisely, an estimate of an appropriate category of the new input data, ${\widehat{y}}_{n+1}$, is determined with the posterior probability as follows:(1)
However, it is usually assumed that a new input data x_{n}_{+1} can be classified into any of the observed categories which exist in the training data, C = {c_{1}, c_{2} , …, c_{K} }. Consequently, if a new input data x_{n}_{+1} belongs to an unobserved category c_{K}_{+1} , which does not exist in the training data, it cannot be classified exactly. Arakawa et al. (2013) proposed the method to solve this problem. We describe Arakawa’s method as the conventional one in Section 3.
2.2.A Mixture of Polya Distributions
On the assumption that documents in a category are generated from a single Polya distribution, the classification method for all of documents with a mixture of Polya distributions is proposed (Yamamoto and Sadamitsu, 2005). Letting the number of mixtures be M, the mixture ratio be λ = (λ_{1}, λ_{2}, …, λ_{M}) , and a parameter of the mth Polya distribution P_{polya} (x_{n}; α_{m}) be α_{m} = (α_{m}_{1}, α_{m}_{2} , …, α_{mV} ), the mixed Polya distribution is defined as follows:(2)
where α = (α_{1}, α_{2} … α_{M} is the parameter vector which consists of M component vectors, ${\alpha}_{m}={\displaystyle {\sum}_{v1}^{V}{\alpha}_{mv}}$ is the summed amount of parameters α_{mv}, and x_{n} is the total summed amount of word frequency in the nth document x_{n}, that is, ${x}_{n}={\displaystyle {\sum}_{v1}^{V}{x}_{nv}}$.
3.CONVENTIONAL METHOD
In this section, we show the conventional method by Arakawa et al. (2013). This method is based on the idea of the partially supervised classification (Liu et al., 2002; Nigam et al., 2000). Usually, a classifier is previously learned by the training data set and it is used to classify a new input data. In the setting of the partially supervised classification, the training data (labeled data) and new input data (unlabeled data) sets are used together for partially supervised classification algorithm. For some specific purposes, this method would be useful in practice. By introducing the idea of partially supervised classification, the problem of the unobserved categories was solved in Arakawa’s method.
The method by Arakawa et al. (2013) is based on a mixture of Polya distributions, where it is assumed that documents in each category are generated from each single Polya distribution. In order to adapt to the data with unobserved categories, they set the number of the mixed Polya distributions as M(M > K +1) , where each component of K Polya distributions models the observed categories respectively, while M − K numbers of Polya distributions models “a set of unobserved categories.”
3.1.Estimating Parameters
In Arakawa’s conventional method, a classification rule is learned from not only prelabeled training data but also unlabeled classification object data under the framework of semisupervised learning. Usually, an unlabeled classification object data is input to the classifier which is given by learning the prelabeled training data. However, both the prelabeled training data and the unlabeled classification object data are input for semisupervised learning in Arakawa’s method. Here, we redefine C as a set of categories {c_{1}, c_{2}, …, c_{K}, c_{K}_{+1}} where c_{1}, c_{2}, …, c_{K} are observed in the training data while c_{K}_{+1} is unobserved. The training data (labeled data) with the size N_{L} is denoted by ${D}_{L}={\left\{\left({x}_{n},\text{\hspace{0.17em}}{y}_{n}\right)\right\}}_{n=1}^{{N}_{L}}$, where y_{n} ∈{c_{1}, c_{2} ,…, c_{K} }. The classification object data (unlabeled data) with the size N_{T} is denoted by ${D}_{T}={\left\{\left({x}_{n},\text{\hspace{0.17em}}{y}_{n}\right)\right\}}_{n={N}_{L}+1}^{{N}_{L}+{N}_{T}}$, where y_{n} ∈ {c_{1}, c_{2} ,…, c_{K} , c_{K}_{ +1}} and y_{n} for n = N_{L} + 1, …, N_{L} + N_{T} have not observed yet. Assuming D_{L} and D_{T} are generated independently, the log likelihood function is denoted by the following equation:
where σ_{nk} is an indicator function which takes 1 if the nth document x_{n} belongs to a category c_{k}, otherwise takes 0. Since the categories for the training data are observed, the first term of equation (3) can be expressed by the indicator σ_{nk}. On the other hand, the second term of equation (3) is expressed by the mixed probabilities due to the categories that have not been observed yet.
In order to estimate each local optimum value of parameters, λ_{m} , α_{mv} , the EM algorithm is adopted (Dempster et al., 1977). The EM algorithm is an iterative method for estimating the suboptimum parameters which locally maximize the above log likelihood function. It alternates the Estep that calculates the expectation of log likelihood and the MStep that updates the value of parameters for maximizing it. Here, we denote latent variables z_{n} as latent categories from which the nth document x_{n} can be generated, that is, z_{n} ∈ C. In the EM algorithm for the estimation of the mixed Polya distribution, the EStep becomes to the procedure to calculate the expected probability of the latent variables.
In the EStep, letting the posterior probability of latent variable z_{n} be P_{nm} which represents the mth component in the mixed Polya distributions, P_{nm} is calculated.
In the MStep, in order to maximize the log likelihood function, the parameters, λ_{m} and α_{mv} are updated. Each step of the EM algorithm is denoted as the following equations. At first, the EStep is denoted as follows:(4)
[EStep]
where λ_{m} and α_{m} are current parameters in the EM algorithm. Then, the MStep is denoted as follows:(5)(6)
[Mstep]
where β nmv and γ nm are given by(7)(8)
3.2.Classification
When the EM algorithm is converged, the posterior probability of each test data can be calculated. The posterior probability $P\left({z}_{n}{x}_{n};\text{\hspace{0.17em}}\overline{\lambda},\text{\hspace{0.17em}}\overline{\alpha}\right)$ for each category c_{k} of the nth document x_{n} is denoted by the following equations:(9)
where λ_{m} and α_{m} are current parameters in the EM algorithm. From the first posterior probability to the Kth posterior probability, each of them designates the assignment probability to each of observed categories respectively, c_{1}, c_{2} … c_{K} which are appeared in the training dataset. On the other hand, the total summed amount from the (K + 1) th posterior probability to the Mth one designates the assignment probability to “a set of unobserved categories,” c_{K}_{+1} . Namely, the assignment probability to each category is denoted as follows:
Lastly, each unlabeled test data is classified into a category ${\widehat{y}}_{n}$ which maximizes equation (10), that is,
4.PROPOSED METHOD
In Arakawa’s conventional method, each single Polya distribution models the generative probability of documents in each category. However, the statistical characteristics of document categories are generally more complicated than the situation assumed in Arakawa’s conventional study. For example, the documents assigned to “sports” category are considered as a mixture of various underlying latent topics, “baseball,” “soccer,” “tennis,” and so on. Moreover, the construction of document categories is supposed to be hierarchal. More specifically, a category with broad information exists in upper layer while splitting various latent topics under its category in lower layer. For example, we can assume that there is a latent topic “baseball” in the category “sports.” Besides it is reasonable to assume the subtopics such as “Major League baseball,” “Japanese Professional baseball,” “High school baseball,” “University baseball,” “baseball” etc. in the latent topic “baseball.” Because of assuming a single Polya distribution to a category, the conventional method does not take account of the variation of word frequency depending on unobserved latent topics.
In this section, we explain our proposed method. Here, let a set of S latent topics in a category c_{k} be T_{k} = {t_{k}_{1}, t_{k}_{2}, …, t_{kS}} . Letting a single Polya distribution correspond to each latent topic t_{ks} , our proposed method models the generative probabilities of documents in a category with a mixture of Polya distributions. The difference between the proposed and the conventional models is shown in the Figure1.
In Arakawa’s conventional model, the categories of the training dataset are observed and used for learning the parameters of this model. On the other hand, the categories of a new input data are not observed but can be predicted. Therefore, the categories are treated as latent variables and estimated by the EM algorithm. Whereas, the proposed model assumes the latent topics in each category and these plural latent topics cannot be observed even in the training data.
4.1.Estimating Parameters
In the proposed method, the log likelihood function for both of the prelabeled training data and the unlabeled classification object data is denoted by the following equation:(12)
where π_{k} = (π_{k}_{1}, π_{k}_{2}, …, π_{kS} ) is a mixture ratio of Polya distributions corresponding to latent topics. The local optimum value of parameters, λ_{m} , π_{ms} , and α_{mv} can be estimated by using the EM algorithm. Here, we denote the new latent variable z_{nm} as the latent topics within the mth category from which the nth document x_{n} can be generated, that is, z_{nm} ∈ T_{m}. Therefore, the EStep in our proposed method becomes the posterior probabilities of two types of latent variables, z_{nm} and z_{n} .
In the EStep, letting the posterior probability of latent variable z_{nm} be P_{nms} which represents the sth component in the mth mixed Polya distributions, the updating formula for P_{nms} can be obtained. After P_{nms} is calculated, letting the posterior probability of latent variable z_{n} be P_{nm} which represents the mth mixed Polya distributions, the updated formula for P_{nm} is also obtained.
In the MStep, in order to maximize the log likelihood function, the parameters, λ_{m} , π_{ms} , and α_{msv} are updated. Each step of the EM algorithm is formulated in the following equations. At first, the EStep is denoted as follows:(13)(14)
[EStep]
where the parameters, λ_{m} , π_{ms} , and α_{msv} are current parameters in the EM algorithm. Then the MStep is denoted by the following equations:(15)(16)(17)(18)(19)
[MStep]
where ${\beta}_{nmsv},{\gamma}_{nms}$
4.2.Classification
Similarly with Arakawa’s conventional method, the assignment probability to each category $P\left({z}_{n}{x}_{n};\text{\hspace{0.17em}}\overline{\lambda},\text{\hspace{0.17em}}\overline{\pi},\text{\hspace{0.17em}}\overline{\alpha}\right)$ of the nth document x_{n} is denoted by the following equation:(20)
Under the assignment probability to each category calculated by above equation (10), an appropriate category ${\widehat{y}}_{n}$ can be obtained by calculating equation (11) as we designated.
5.EXPERIMENTS
In this section, we show that our proposed method outperforms the conventional method from the viewpoint of classification accuracy through the result of simulation experiment using the set of English newspaper articles.
5.1.Experimental Condition
In the experiments, we used the 20 English newsgroups dataset (Lang, 1995) consisting of approximately 20,000 documents data with 20 categories which are shown in Table 1. In preprocessing this data set, we randomly select 600 documents from each category and removed lower frequent words less than 5 times as unnecessary words. As the results, the selected 12,000 documents comprised 22,993 unique words. Here, we implement three different experiments by changing the number of observed categories, K = 19, K = 15, and K = 10, where K is the number of observed categories in the training dataset. Because the number of total categories is 20, the number of unobserved categories in the training dataset is given by 20K. From each observed category, we randomly select 300 documents as the training data, and let the rest be the test data. On the other hand, from each unobserved category, we randomly select 300 documents as the test data. We repeat this selection ten times and the results of our experiment are given as averages of ten times. Changing the number of latent topics, S = 1~5, we conduct the classification experiments, where in case of S = 1, our proposed method is identical with Arakawa’s conventional one. Moreover, also changing the number of models from M = (K + 1) to M = (K + 10), we evaluate the classification accuracy for both observed and unobserved categories, where the classification accuracy is calculated by the following equation:(21)
5.2.Result of Experiments
5.2.1.Experiment 1: The Case of K = 19
In the first experiment, we select 1 category at random from 20 categories as the unobserved one in the training data set, and let the rest be observed. The result of the experiment 1 is shown in the following Table 2~ Table 4.Table 3
From Table 2 and Table 3, we can clarify the tradeoff relationship of the classification accuracy between observed and unobserved categories. From Table 3, it can be seen that our proposed method drastically outperforms the conventional one in classification accuracy for the unobserved categories. For the unobserved categories, our proposed method can express exactly by modeling latent topics. On the other hand, the classification accuracy of our proposed method for the observed categories is lower than the conventional one.
Regarding the observed categories, because the number of test documents with categories observed is not much large, our proposed method causes the overfitting for the observed ones. That is the reason why the accuracy for observed categories by our proposed method markedly decreases as the number of latent topics increases. Therefore, our proposed method cannot improve the classification accuracy compared with the conventional method in this case.
5.2.2.Experiment 2: The case of K = 15
In the second experiment, we select 5 categories at random from all 20 categories and let the selected categories be unobserved in the training data set. We let the rest be observed. The result of the experiment 2 is shown in Figure 2 and Table 5.
As the proportion of test documents with the unobserved categories to that with the observed ones is larger than the experiment 1, our proposed method has better accuracy for all categories. On the other hand, although the conventional method performs high accuracy for the observed categories, it cannot express the statistical characteristics of the unobserved ones because various latent topics are not assumed in a category. As the consequence, we come to the conclusion that our proposed method has better generalization ability to both of observed and unobserved categories although it sacrifices the accuracy for the observed ones slightly.
5.2.3.Experiment 3: The case of K = 10
In the third experiment, we select 10 categories at random from all 20 categories and let them be unobserved in the training data set, and let the rest be observed. The result of the experiment 3 is shown in Figure 3 and Table 6.
In this experiment, the number of test documents with the unobserved categories is the same as with the observed ones. Therefore, while the conventional method performs less accuracy for the unobserved categories than the observed ones, our proposed method shows the same or better accuracy for unobserved categories. As the result, the proposed method outperforms the conventional method in the accuracy for all categories, so that it has the better generalization ability to both of observed and unobserved categories.
5.3.Discussion
First, we discuss about the relationship between the classification accuracy and the number of observed categories K. In the simulation experiments, the total accuracy of our proposed method is calculated by the accuracies for both observed categories and unobserved categories. These two accuracies depend on the number of observed categories K, and the size of training and test data which are provided by K.
In the proposed method, with the framework of semisupervised learning, the model is learned from not only training data but also test data. For example, when K = 19, the characteristics of observed categories are learned from 300×19 = 5,700 numbers of training data and 300×19 = 5,700 test data. On the other hand, the characteristics of “a set of unobserved categories” are learned from only 300 numbers of test data, so that the accurate classification is more difficult than observed categories (It can be seen from Table 3).
Moreover, let us discuss about the relationship between the classification accuracy and the number of latent topics S. As we discussed above, the classification accuracy is mainly provided by the two factors:

1) the number of mixed distributions

2) the proportion of sizes between training data set and test data set
Regarding the number of mixed distributions, the expression capabilities of our model are same for both observed and unobserved categories when K = 19 and M = 20. However, the classification accuracy is also depending on the numbers of training data and test data.
In the case of S = 1, K = 19 and M = 20, where this setting (S = 1) means the conventional model, the numbers of training data and test data for observed categories are the same and 300×19 = 5,700. But, the number of training data for unobserved data is 0, while that of test data is 300. Therefore, it can be considered that the accurate classification for the test data of unobserved category is difficult though it is possible to predict accurate classification for the observed categories.
If the number of latent topics S increases, the numbers of distributions attached to each category also increase. Because the expression capability for the unobserved categories is cultivated, the classification accuracy can be improved (Please see the row with M = 20 on Table 3). For the observed categories, the expression capability is also cultivated. However, the sufficient learning is already realized when S = 1 for the observed categories. Therefore, a larger S causes overfitting the training data of the observed categories and it leads to decrease the classification accuracy. This would be the reason why the classification accuracy for the observed categories is worsen when S increases (Please see the row with M = 20 on Table 2).
Finally, we discuss the way to determine the optimal numbers of parameters, M and S. From the experimental result, it can be seen the tradeoff relationship of the classification accuracy between observed and unobserved categories, where we lose the classification accuracy for the observed categories by increasing the parameter M in return for improving that of unobserved ones. With regard to the parameter S, it can be seen the same behavior of the classification accuracy between both observed and unobserved categories. If the number of unobserved categories S increases, then the classification accuracy for the observed categories decreases because of overfitting but that for the unobserved categories is improved because of its expression capability for unobserved categories. For every result of experiments, we can see the tradeoff relation. However, in comparison with parameter M, the variation of the classification accuracy caused by changing parameter S is much larger. Therefore, the parameter S can be more effective for controlling the classification accuracy. Besides, from the preliminary experiment, with small amount of document data, if the parameter S is set as larger number than optimal one, our proposed model causes over fitting, thus the classification accuracy cannot be more improved. As a conclusion, because the document data is divided into M×S parts in our proposed model, the statistics are not enough when the parameter S is extremely large.
Therefore, it is preferable that after obtaining the optimal number of the parameter S empirically by adjusting it depending on the data size and the stochastic characteristic of document data, the fine adjustment of the parameter M is performed in practice.
Additionally, in order to compare the proposed method and the conventional method with larger M, we conduct the additional experiment of the conventional method, and show the result of every experiment. The additional results of the conventional method for the cases K = 19, 15, and 10 are shown in Figures 4, Figures 5 and Figures 6. In these experiments, the numbers of mixed Polya distributions are set as M = K + 1, K + … Each dot in these figures means a result on a setting of M.
From Figure 4 to Figure 6, it can be seen that the accuracy for unobserved categories by the conventional method grows higher as the number of observed categories decreases. However, from Figure 6, we can see that it is not possible to improve the accuracy for unobserved categories in comparison with the proposed method in the situation K = 10 although the conventional method shows the most competitive result to the proposed method. Moreover, the conventional method shows strong tradeoff relationship of the accuracy between observed and unobserved categories. Therefore, when M is as much larger as 100 in the conventional method, this method shows partly high classification accuracy for unobserved categories, while it shows the same level of performance for observed categories as the proposed method.
If we set M with extremely larger as M ≫ 100, the conventional method would show the high performance for unobserved categories, whereas it would sacrifice the accuracy for observed categories. In addition to this disability, when M ≫ 100, the number of topics assumed to unobserved categories is excessive, thereby the amounts of calculation would also grow so much. It is, therefore, not preferable to set M as larger than 100 for the conventional method.
From above discussion, the effectiveness of the proposed method which shows stable performance for both categories with comparatively small S and M is shown.
6.CONCLUSION AND FUTURE WORK
In this paper, we proposed a new classification method for documents with unobserved categories. While a single Polya distribution is represented to a category in the conventional method, the proposed method assumes that a single Polya distribution represents each latent topic in a category. Through the result of classification experiment, we show that our proposed method outperforms the conventional method from the viewpoint of classification accuracy mainly for the unobserved categories, so that our proposal has better generalization ability than the conventional one.
On the other hand, the optimal number of latent topics in a category can be different depending on the stochastic characteristics of document data. In the proposed method, we cannot set the different optimal numbers of latent topics respectively according to each category. It is the future work to investigate the effectiveness of the way to set the different optimal numbers of latent topics depending on each category. For more improvement, it can be effective to construct the model which determines automatically the optimal number of latent topics in categories, for example, with Dirichlet process (Teh et al., 2004) or the infinite dirichlet mixture models (Mochihashi and Kikui, 2006).