• Editorial Board +
• For Contributors +
• Journal Search +
Journal Search Engine
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.16 No.2 pp.165-174
DOI : https://doi.org/10.7232/iems.2017.16.2.165

# A Proposal for Classification of Document Data with Unobserved Categories Considering Latent Topics

Yusei Yamamoto*, Kenta Mikawa, Masayuki Goto
Graduate School of Creative Science and Engineering Waseda University, Tokyo, JAPAN
Department of Information Science, Shonan Institute of Technology, Kanagawa, JAPAN
Department of Industrial and Management Systems Engineering Waseda University, Tokyo, JAPAN
Corresponding Author, yusei0809@ruri.waseda.jp
April 14, 2016 February 21, 2017 March 21, 2017

## ABSTRACT

With rapid development on information society, automatic document classification by machine learning has become even more important. In document classification, it is assumed that a new input data can be classified into any of the categories observed in the training data. Therefore, if a new input data belongs to an unobserved category which does not exist in the training data, then such data cannot be classified exactly. To solve the above problem, Arakawa et al. proposed the method which models the generative probabilities of documents with a mixture of Polya distributions and estimates the optimum category within all observed and unobserved categories where it is assumed that documents in each category are generated from each single Polya distribution. However, the statistical characteristics of document categories are generally more complicated and there are various underlying latent topics in a category. Because a single Polya distribution models each category in the conventional approach, this method cannot represent the variation of word frequency depending on plural unobserved latent topics. This paper proposes a new model which assumes a mixture of Polya distributions for the generative probabilities of documents in a category to represent plural latent topics. To verify the effectiveness of the proposed method, we conduct the simulation experiments of document classification by using a set of English newspaper articles.

## 1.INTRODUCTION

With rapid development on information society, the technology for automatic document classification has become even more important. The document classification is now an important tool of business analytics even in the field of industrial management. In document classification, the machine learning techniques are usually useful to construct an appropriate classifier. After a classification rule is learned from the pre-labeled training data (supervised data), a new input data can be classified into an appropriate category according to its content. On the usual setting, it is assumed that a new input data can be classified into any of categories observed in the training data (Ishida, 2006). However, if a new data belonging to an unobserved category which does not exist in the training data is input to the learned classifier, then such data cannot be classified correctly.

As the method which can discriminate whether the data belongs to unobserved categories or not, reject rules in the classification have been studied (Chow, 1970). This study first needs the setting of a threshold in some measurement, then detects the data exceeding the threshold as an outlier. Therefore, the threshold should be determined in according with the number of data to be detected. However, it is possibly difficult to determine an optimum threshold in practice.

To solve the above problem, Arakawa et al. (2013) proposed the method which models the generative probabilities of documents with a mixture of Polya distributions and estimates the optimum category within all observed and unobserved categories, where it is assumed that documents in each category are generated from each single Polya distribution (Arakawa et al., 2013). This method is based on the idea of the partially supervised classification (Amini and Usunier, 2015, Baluja, 1998, Bishop, 2010, Castelli and Cover, 1996, Liu et al., 2002; Miller and Uyar, 1997, Nigam et al., 2000, Seeger, 2001, Szummer, 2002, Szummer and Jaakkola, 2002a, 2002b). By introducing the idea of partially supervised classification (Nigam et al., 2000), the problem of the unobserved categories was solved in Arakawa’s method. Here, a mixture of Polya distributions is the model which derives from setting Dirichlet distributions (Minka, 2003) as a prior to the unigram mixture that is based on the multinomial distribution (Masada et al., 2007, Sadamitsu et al., 2005, Ushiama et al., 2010). Differing from the multi-topic models such as LDA (Latent Dirichlet Allocation) which represents plural latent topics of a document by assuming each word in the document has their own topic respectively (Blei et al., 2003), a mixture of Polya distributions is known as the single topic model assuming that each document has a single topic. This single topic model can give categories of unlabeled object data more explicitly. Therefore, for detecting the unobserved categories, the single topic model can be more suited.

However, the statistical characteristics of document categories are generally more complicated in practice than the situation assumed for Arakawa’s conventional method. For example, the documents assigned to the category “sports” are considered as a mixture of various underlying latent topics such as “baseball,” “soccer,” “tennis,” and so on. In Arakawa’s method, each category is modeled by a single Polya distribution, so that this method cannot represent the variation of word frequency depending on unobserved latent topics. In other words, Arakawa’s conventional method cannot consider plural latent topics in a category.

In this paper, we propose the method which models generative probabilities of documents in a category with a mixture of Polya distributions (the mixed Polya distribution) by letting a single Polya distribution correspond to each latent topic. We show that the proposed method outperforms Arakawa’s conventional one from the viewpoint of classification accuracy through the result of classification experiment by using a set of English newspaper articles.

## 2.PRELIMINARIES

In this section, we first describe the definitions in the document classification problem, and then we show the previous study of the mixture of Polya distributions where a single Polya distribution is assumed for the generative probability of a document in a category.

### 2.1.Document Classification

First, we denote $D = { ( x n , y n ) } n = 1 N$ as the training dataset for learning and C = {c1, c2 , …, cK } as a set of categories, where ynC. Here, N is the number of training data, and K is the number of categories. The training data means labeled data to be supervised to build a classification model. The statistical characteristic of each document xn is expressed by bag-of-words. Letting W = {w1, w2, …, wV} be a set of words which appeared in all training documents, the n-th document xn is represented by discrete vector xn = (xn1, xn2, …, xnV ), where wv is the v-th word in an above word set W, and xnv is the v-th word frequency in n-th document. On the document classification problem, after a set of training data is given, a new unlabeled input data ( xn+1, yn+1) is classified into a category in C. More precisely, an estimate of an appropriate category of the new input data, $y ^ n + 1$, is determined with the posterior probability as follows:(1)

$y ^ n + 1 = arg max y n + 1 ∈ C P ( y n + 1 | x n + 1 )$
(1)

However, it is usually assumed that a new input data xn+1 can be classified into any of the observed categories which exist in the training data, C = {c1, c2 , …, cK }. Consequently, if a new input data xn+1 belongs to an unobserved category cK+1 , which does not exist in the training data, it cannot be classified exactly. Arakawa et al. (2013) proposed the method to solve this problem. We describe Arakawa’s method as the conventional one in Section 3.

### 2.2.A Mixture of Polya Distributions

On the assumption that documents in a category are generated from a single Polya distribution, the classification method for all of documents with a mixture of Polya distributions is proposed (Yamamoto and Sadamitsu, 2005). Letting the number of mixtures be M, the mixture ratio be λ = (λ1, λ2, …, λM) , and a parameter of the m-th Polya distribution Ppolya (xn; αm) be αm = (αm1, αm2 , …, αmV ), the mixed Polya distribution is defined as follows:(2)

$P ( x n ; λ , a ) = ∑ m = 1 M λ m P p o l y a ( x n α m ) = ∑ m = 1 M λ m Γ ( α m ) Γ ( α m + x n ) ∏ v = 1 V Γ ( α m v + x n v ) Γ ( α m v )$
(2)

where α = (α1, α2αM is the parameter vector which consists of M component vectors, $α m = ∑ v − 1 V α m v$ is the summed amount of parameters αmv, and xn is the total summed amount of word frequency in the n-th document xn, that is, $x n = ∑ v − 1 V x n v$.

## 3.CONVENTIONAL METHOD

In this section, we show the conventional method by Arakawa et al. (2013). This method is based on the idea of the partially supervised classification (Liu et al., 2002; Nigam et al., 2000). Usually, a classifier is previously learned by the training data set and it is used to classify a new input data. In the setting of the partially supervised classification, the training data (labeled data) and new input data (unlabeled data) sets are used together for partially supervised classification algorithm. For some specific purposes, this method would be useful in practice. By introducing the idea of partially supervised classification, the problem of the unobserved categories was solved in Arakawa’s method.

The method by Arakawa et al. (2013) is based on a mixture of Polya distributions, where it is assumed that documents in each category are generated from each single Polya distribution. In order to adapt to the data with unobserved categories, they set the number of the mixed Polya distributions as M(M > K +1) , where each component of K Polya distributions models the observed categories respectively, while MK numbers of Polya distributions models “a set of unobserved categories.”

### 3.1.Estimating Parameters

In Arakawa’s conventional method, a classification rule is learned from not only pre-labeled training data but also unlabeled classification object data under the framework of semi-supervised learning. Usually, an unlabeled classification object data is input to the classifier which is given by learning the pre-labeled training data. However, both the pre-labeled training data and the unlabeled classification object data are input for semi-supervised learning in Arakawa’s method. Here, we redefine C as a set of categories {c1, c2, …, cK, cK+1} where c1, c2, …, cK are observed in the training data while cK+1 is unobserved. The training data (labeled data) with the size NL is denoted by $D L = { ( x n , y n ) } n = 1 N L$, where yn ∈{c1, c2 ,…, cK }. The classification object data (unlabeled data) with the size NT is denoted by $D T = { ( x n , y n ) } n = N L + 1 N L + N T$, where yn ∈ {c1, c2 ,…, cK , cK +1} and yn for n = NL + 1, …, NL + NT have not observed yet. Assuming DL and DT are generated independently, the log likelihood function is denoted by the following equation:

$log L ( D L , D T ; λ , α ) = ∑ n = 1 N L log ∑ k = 1 K σ n k λ k P p o l y a ( x n ; α k ) + ∑ m = N L + 11 N L + N r log ∑ m = 1 M λ m P p o l y a ( x n ; α m )$
(3)

where σnk is an indicator function which takes 1 if the n-th document xn belongs to a category ck, otherwise takes 0. Since the categories for the training data are observed, the first term of equation (3) can be expressed by the indicator σnk. On the other hand, the second term of equation (3) is expressed by the mixed probabilities due to the categories that have not been observed yet.

In order to estimate each local optimum value of parameters, λm , αmv , the EM algorithm is adopted (Dempster et al., 1977). The EM algorithm is an iterative method for estimating the suboptimum parameters which locally maximize the above log likelihood function. It alternates the E-step that calculates the expectation of log likelihood and the M-Step that updates the value of parameters for maximizing it. Here, we denote latent variables zn as latent categories from which the n-th document xn can be generated, that is, znC. In the EM algorithm for the estimation of the mixed Polya distribution, the E-Step becomes to the procedure to calculate the expected probability of the latent variables.

In the E-Step, letting the posterior probability of latent variable zn be Pnm which represents the m-th component in the mixed Polya distributions, Pnm is calculated.

In the M-Step, in order to maximize the log likelihood function, the parameters, λm and αmv are updated. Each step of the EM algorithm is denoted as the following equations. At first, the E-Step is denoted as follows:(4)

[E-Step]

$P n m = λ ¯ m P p o l y a ( x n ; a ¯ ; m ) ∑ m = 1 M λ ¯ m P p o l y a ( x n ; a ¯ ; m )$
(4)

where λm and αm are current parameters in the EM algorithm. Then, the M-Step is denoted as follows:(5)(6)

[M-step]

$λ m = 1 N L + N T ( ∑ n = 1 N L σ n m + ∑ n = N L + 1 N L + N r P n m )$
(5)

$α m v = α ¯ m v ∑ n = 1 N L σ n m β n m v + ∑ n = N L + 1 N L + N r P n m β n m v ∑ n = 1 N L σ n m γ n m + ∑ n = N L + 1 N L + N r P n m γ n m$
(6)

where β nmv and γ nm are given by(7)(8)

$β n m v = x n v x n v − 1 + α ¯ m v$
(7)

$γ n m = x n x n − 1 + α ¯ m$
(8)

### 3.2.Classification

When the EM algorithm is converged, the posterior probability of each test data can be calculated. The posterior probability $P ( z n | x n ; λ ¯ , α ¯ )$ for each category ck of the n-th document xn is denoted by the following equations:(9)

$P ( z n = c k | x n ; λ ¯ m , α ¯ m ) = P ( z n = c k ) P ( x n | z n ; λ ¯ ; α ¯ m ) ∑ m = 1 M P ( z = c m ) P ( x n | z n ; λ ¯ , ​ α ¯ m )$
(9)

where λm and αm are current parameters in the EM algorithm. From the first posterior probability to the K-th posterior probability, each of them designates the assignment probability to each of observed categories respectively, c1, c2cK which are appeared in the training dataset. On the other hand, the total summed amount from the (K + 1) -th posterior probability to the M-th one designates the assignment probability to “a set of unobserved categories,” cK+1 . Namely, the assignment probability to each category is denoted as follows:

$P ( y n = c k | x n ) = { P ( z n = c k | x n ) k = 1 , ⋯ , ​ K ∑ m = K + 1 M P ( z n = c m | x n ) k = K + 1$
(10)

Lastly, each unlabeled test data is classified into a category $y ^ n$ which maximizes equation (10), that is,

$y ^ n = arg max y n ∈ C P ( y n | x n )$
(11)

## 4.PROPOSED METHOD

In Arakawa’s conventional method, each single Polya distribution models the generative probability of documents in each category. However, the statistical characteristics of document categories are generally more complicated than the situation assumed in Arakawa’s conventional study. For example, the documents assigned to “sports” category are considered as a mixture of various underlying latent topics, “baseball,” “soccer,” “tennis,” and so on. Moreover, the construction of document categories is supposed to be hierarchal. More specifically, a category with broad information exists in upper layer while splitting various latent topics under its category in lower layer. For example, we can assume that there is a latent topic “baseball” in the category “sports.” Besides it is reasonable to assume the subtopics such as “Major League baseball,” “Japanese Professional baseball,” “High school baseball,” “University baseball,” “baseball” etc. in the latent topic “baseball.” Because of assuming a single Polya distribution to a category, the conventional method does not take account of the variation of word frequency depending on unobserved latent topics.

In this section, we explain our proposed method. Here, let a set of S latent topics in a category ck be Tk = {tk1, tk2, …, tkS} . Letting a single Polya distribution correspond to each latent topic tks , our proposed method models the generative probabilities of documents in a category with a mixture of Polya distributions. The difference between the proposed and the conventional models is shown in the Figure1.

In Arakawa’s conventional model, the categories of the training dataset are observed and used for learning the parameters of this model. On the other hand, the categories of a new input data are not observed but can be predicted. Therefore, the categories are treated as latent variables and estimated by the EM algorithm. Whereas, the proposed model assumes the latent topics in each category and these plural latent topics cannot be observed even in the training data.

### 4.1.Estimating Parameters

In the proposed method, the log likelihood function for both of the pre-labeled training data and the unlabeled classification object data is denoted by the following equation:(12)

$log L ( D L , D T , λ , π , α ) = ∑ n = 1 N L log ∑ k = 1 K σ n k λ k ∑ s = 1 S π k s P p o l y a ( x n ; α k s ) + ∑ n = N L + 1 N L + N r log ∑ m = 1 M λ m ∑ s = 1 S π m s P p o l y a ( x n ; α m s )$
(12)

where πk = (πk1, πk2, …, πkS ) is a mixture ratio of Polya distributions corresponding to latent topics. The local optimum value of parameters, λm , πms , and αmv can be estimated by using the EM algorithm. Here, we denote the new latent variable znm as the latent topics within the m-th category from which the n-th document xn can be generated, that is, znmTm. Therefore, the E-Step in our proposed method becomes the posterior probabilities of two types of latent variables, znm and zn .

In the E-Step, letting the posterior probability of latent variable znm be Pnms which represents the s-th component in the m-th mixed Polya distributions, the updating formula for Pnms can be obtained. After Pnms is calculated, letting the posterior probability of latent variable zn be Pnm which represents the m-th mixed Polya distributions, the updated formula for Pnm is also obtained.

In the M-Step, in order to maximize the log likelihood function, the parameters, λm , πms , and αmsv are updated. Each step of the EM algorithm is formulated in the following equations. At first, the E-Step is denoted as follows:(13)(14)

[E-Step]

$P n m s = π ¯ m s P p o l y a ( x n ; α ¯ m s ) ∑ s = 1 S π ¯ m s P p o l y a ( x n ; α ¯ m s )$
(13)

$P n m = λ ¯ m ∑ s = 1 S π ¯ m s P p o l y a ( x n ; α ¯ m s ) ∑ m = 1 M λ ¯ m s ∑ s = 1 S π ¯ m s P p o l y a ( x n ; α ¯ m s )$
(14)

where the parameters, λm , πms , and αmsv are current parameters in the EM algorithm. Then the M-Step is denoted by the following equations:(15)(16)(17)(18)(19)

[M-Step]

$λ m = 1 N L + N T ( ∑ n = 1 N L σ n m + ∑ n = N L + 1 N L + N r P n m )$
(15)

$π m s = M N L + N T ( ∑ n = 1 N L σ n m P n m s + ∑ n = N L + 1 N L + N r P n m P n m s )$
(16)

$α m s v = α ¯ m s v ∑ n = 1 N L α n m P n m s β n m s v + ∑ n = N L + 1 N L + N r P n m P n m s β n m s v ∑ n = 1 N L α n m P n m s γ n m s + ∑ n = N L + 1 N L + N r P n m P n m s γ n m s v$
(17)

where $β n m s v , γ n m s$

$β n m s v = x n v x n v − 1 + α ¯ m s v$
(18)

$γ n m s v = x n x n − 1 + α ¯ m s$
(19)

### 4.2.Classification

Similarly with Arakawa’s conventional method, the assignment probability to each category $P ( z n | x n ; λ ¯ , π ¯ , α ¯ )$ of the n-th document xn is denoted by the following equation:(20)

$( z n = c k | x n ; λ ¯ , π ¯ , α ¯ ) = = P ( z n = c k ) ∑ s = 1 S P ( z n k = t k s ) P ( x n | z n , z n k ; λ ¯ , π ¯ , α ¯ ) ∑ m = 1 M P ( z n = c m ) ∑ s = 1 S P ( z n m = t m s ) P ( x n | z n , z n m λ ¯ π ¯ , α ¯ , )$
(20)

Under the assignment probability to each category calculated by above equation (10), an appropriate category $y ^ n$ can be obtained by calculating equation (11) as we designated.

## 5.EXPERIMENTS

In this section, we show that our proposed method outperforms the conventional method from the viewpoint of classification accuracy through the result of simulation experiment using the set of English newspaper articles.

### 5.1.Experimental Condition

In the experiments, we used the 20 English newsgroups dataset (Lang, 1995) consisting of approximately 20,000 documents data with 20 categories which are shown in Table 1. In preprocessing this data set, we randomly select 600 documents from each category and removed lower frequent words less than 5 times as unnecessary words. As the results, the selected 12,000 documents comprised 22,993 unique words. Here, we implement three different experiments by changing the number of observed categories, K = 19, K = 15, and K = 10, where K is the number of observed categories in the training dataset. Because the number of total categories is 20, the number of unobserved categories in the training dataset is given by 20-K. From each observed category, we randomly select 300 documents as the training data, and let the rest be the test data. On the other hand, from each unobserved category, we randomly select 300 documents as the test data. We repeat this selection ten times and the results of our experiment are given as averages of ten times. Changing the number of latent topics, S = 1~5, we conduct the classification experiments, where in case of S = 1, our proposed method is identical with Arakawa’s conventional one. Moreover, also changing the number of models from M = (K + 1) to M = (K + 10), we evaluate the classification accuracy for both observed and unobserved categories, where the classification accuracy is calculated by the following equation:(21)

$A c c c u r a c y = t h e n u m b e r o f t e s t d a t a c l a s s f i e d c o r r e c t l y t h e n u m b e r o f t e s t d a t a$
(21)

### 5.2.Result of Experiments

#### 5.2.1.Experiment 1: The Case of K = 19

In the first experiment, we select 1 category at random from 20 categories as the unobserved one in the training data set, and let the rest be observed. The result of the experiment 1 is shown in the following Table 2~ Table 4.Table 3

From Table 2 and Table 3, we can clarify the tradeoff relationship of the classification accuracy between observed and unobserved categories. From Table 3, it can be seen that our proposed method drastically outperforms the conventional one in classification accuracy for the unobserved categories. For the unobserved categories, our proposed method can express exactly by modeling latent topics. On the other hand, the classification accuracy of our proposed method for the observed categories is lower than the conventional one.

Regarding the observed categories, because the number of test documents with categories observed is not much large, our proposed method causes the overfitting for the observed ones. That is the reason why the accuracy for observed categories by our proposed method markedly decreases as the number of latent topics increases. Therefore, our proposed method cannot improve the classification accuracy compared with the conventional method in this case.

#### 5.2.2.Experiment 2: The case of K = 15

In the second experiment, we select 5 categories at random from all 20 categories and let the selected categories be unobserved in the training data set. We let the rest be observed. The result of the experiment 2 is shown in Figure 2 and Table 5.

As the proportion of test documents with the unobserved categories to that with the observed ones is larger than the experiment 1, our proposed method has better accuracy for all categories. On the other hand, although the conventional method performs high accuracy for the observed categories, it cannot express the statistical characteristics of the unobserved ones because various latent topics are not assumed in a category. As the consequence, we come to the conclusion that our proposed method has better generalization ability to both of observed and unobserved categories although it sacrifices the accuracy for the observed ones slightly.

#### 5.2.3.Experiment 3: The case of K = 10

In the third experiment, we select 10 categories at random from all 20 categories and let them be unobserved in the training data set, and let the rest be observed. The result of the experiment 3 is shown in Figure 3 and Table 6.

In this experiment, the number of test documents with the unobserved categories is the same as with the observed ones. Therefore, while the conventional method performs less accuracy for the unobserved categories than the observed ones, our proposed method shows the same or better accuracy for unobserved categories. As the result, the proposed method outperforms the conventional method in the accuracy for all categories, so that it has the better generalization ability to both of observed and unobserved categories.

### 5.3.Discussion

First, we discuss about the relationship between the classification accuracy and the number of observed categories K. In the simulation experiments, the total accuracy of our proposed method is calculated by the accuracies for both observed categories and unobserved categories. These two accuracies depend on the number of observed categories K, and the size of training and test data which are provided by K.

In the proposed method, with the framework of semi-supervised learning, the model is learned from not only training data but also test data. For example, when K = 19, the characteristics of observed categories are learned from 300×19 = 5,700 numbers of training data and 300×19 = 5,700 test data. On the other hand, the characteristics of “a set of unobserved categories” are learned from only 300 numbers of test data, so that the accurate classification is more difficult than observed categories (It can be seen from Table 3).

Moreover, let us discuss about the relationship between the classification accuracy and the number of latent topics S. As we discussed above, the classification accuracy is mainly provided by the two factors:

• 1) the number of mixed distributions

• 2) the proportion of sizes between training data set and test data set

Regarding the number of mixed distributions, the expression capabilities of our model are same for both observed and unobserved categories when K = 19 and M = 20. However, the classification accuracy is also depending on the numbers of training data and test data.

In the case of S = 1, K = 19 and M = 20, where this setting (S = 1) means the conventional model, the numbers of training data and test data for observed categories are the same and 300×19 = 5,700. But, the number of training data for unobserved data is 0, while that of test data is 300. Therefore, it can be considered that the accurate classification for the test data of unobserved category is difficult though it is possible to predict accurate classification for the observed categories.

If the number of latent topics S increases, the numbers of distributions attached to each category also increase. Because the expression capability for the unobserved categories is cultivated, the classification accuracy can be improved (Please see the row with M = 20 on Table 3). For the observed categories, the expression capability is also cultivated. However, the sufficient learning is already realized when S = 1 for the observed categories. Therefore, a larger S causes overfitting the training data of the observed categories and it leads to decrease the classification accuracy. This would be the reason why the classification accuracy for the observed categories is worsen when S increases (Please see the row with M = 20 on Table 2).

Finally, we discuss the way to determine the optimal numbers of parameters, M and S. From the experimental result, it can be seen the trade-off relationship of the classification accuracy between observed and unobserved categories, where we lose the classification accuracy for the observed categories by increasing the parameter M in return for improving that of unobserved ones. With regard to the parameter S, it can be seen the same behavior of the classification accuracy between both observed and unobserved categories. If the number of unobserved categories S increases, then the classification accuracy for the observed categories decreases because of overfitting but that for the unobserved categories is improved because of its expression capability for unobserved categories. For every result of experiments, we can see the trade-off relation. However, in comparison with parameter M, the variation of the classification accuracy caused by changing parameter S is much larger. Therefore, the parameter S can be more effective for controlling the classification accuracy. Besides, from the preliminary experiment, with small amount of document data, if the parameter S is set as larger number than optimal one, our proposed model causes over fitting, thus the classification accuracy cannot be more improved. As a conclusion, because the document data is divided into M×S parts in our proposed model, the statistics are not enough when the parameter S is extremely large.

Therefore, it is preferable that after obtaining the optimal number of the parameter S empirically by adjusting it depending on the data size and the stochastic characteristic of document data, the fine adjustment of the parameter M is performed in practice.

Additionally, in order to compare the proposed method and the conventional method with larger M, we conduct the additional experiment of the conventional method, and show the result of every experiment. The additional results of the conventional method for the cases K = 19, 15, and 10 are shown in Figures 4, Figures 5 and Figures 6. In these experiments, the numbers of mixed Polya distributions are set as M = K + 1, K + … Each dot in these figures means a result on a setting of M.

From Figure 4 to Figure 6, it can be seen that the accuracy for unobserved categories by the conventional method grows higher as the number of observed categories decreases. However, from Figure 6, we can see that it is not possible to improve the accuracy for unobserved categories in comparison with the proposed method in the situation K = 10 although the conventional method shows the most competitive result to the proposed method. Moreover, the conventional method shows strong tradeoff relationship of the accuracy between observed and unobserved categories. Therefore, when M is as much larger as 100 in the conventional method, this method shows partly high classification accuracy for unobserved categories, while it shows the same level of performance for observed categories as the proposed method.

If we set M with extremely larger as M ≫ 100, the conventional method would show the high performance for unobserved categories, whereas it would sacrifice the accuracy for observed categories. In addition to this disability, when M ≫ 100, the number of topics assumed to unobserved categories is excessive, thereby the amounts of calculation would also grow so much. It is, therefore, not preferable to set M as larger than 100 for the conventional method.

From above discussion, the effectiveness of the proposed method which shows stable performance for both categories with comparatively small S and M is shown.

## 6.CONCLUSION AND FUTURE WORK

In this paper, we proposed a new classification method for documents with unobserved categories. While a single Polya distribution is represented to a category in the conventional method, the proposed method assumes that a single Polya distribution represents each latent topic in a category. Through the result of classification experiment, we show that our proposed method outperforms the conventional method from the viewpoint of classification accuracy mainly for the unobserved categories, so that our proposal has better generalization ability than the conventional one.

On the other hand, the optimal number of latent topics in a category can be different depending on the stochastic characteristics of document data. In the proposed method, we cannot set the different optimal numbers of latent topics respectively according to each category. It is the future work to investigate the effectiveness of the way to set the different optimal numbers of latent topics depending on each category. For more improvement, it can be effective to construct the model which determines automatically the optimal number of latent topics in categories, for example, with Dirichlet process (Teh et al., 2004) or the infinite dirichlet mixture models (Mochihashi and Kikui, 2006).

## ACKNOWLEDGEMENT

The authors would like to thank anonymous referees for their insightful comments. The authors would also like to express thanks to Mr. Kumoi, Dr. Haruka Yamashita, and all members of Goto laboratory, Waseda University, for their support for our research. This study was partly supported by JSPS KAKENHI Grant Numbers, 26282090 and 26560167.

## Figure

The difference between the conventional method (left) and the proposed method (right).

The result of classification accuracy for observed and unobserved categories in experiment 2.

The result of classification accuracy for observed and unobserved categories in experiment 3.

Classification accuracy of the conventional method in K = 19.

Classification accuracy of the conventional method in K = 15.

Classification accuracy of the conventional method in K = 10.

## Table

20 Newsgroups data set

The result of classification accuracy for observed categories in experiment 1

The result of classification accuracy for unobserved categories in experiment 1

The result of classification accuracy for all categories in experiment 1

The result of classification accuracy for all categories in experiment 2

The result of classification accuracy for all categories in experiment 3

## REFERENCES

1. Amini M R , Usunier N (2015) Learning with Partially Labeled and Interdependent Data, Springer International Publishing,
2. Arakawa T , Mikawa K , Goto M (2013) A study on document classification method with containing unknown categories , Ieice Transactions on Information and Systems, Vol.J96-D ; pp.1956-1959
3. Baluja S (1998) Probabilistic modeling for face orientation discrimination: Learning from labeled and unlabeled data , Advances in Neural Information Processing Systems, ; pp.854-860
4. Bishop C (2010) Pattern Recognition and Machine Learning, Springer-Verlag,
5. Blei D M , Ng A Y , Jordan M I (2003) Latent dirichlet allocation , Journal of Machine Learning Research, Vol.3 ; pp.993-1022
6. Castelli V , Cover T (1996) The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter , IEEE Transactions on Information Theory, Vol.42 ; pp.2102-2117
7. Chow C K (1970) On optimum recognition error and reject tradeoff , IEEE Transactions on Information Theory, Vol.16 ; pp.41-46
8. Dempster A P , Laird N M , Rubin D B (1977) Maximum likelihood from incomplete data via the EM algorithm , Journal of the Royal Statistical Society, Vol.39 ; pp.1-38
9. Ishida E (2006) An overview of text categorization , Information Science and Technology Association, Vol.56 ; pp.469-474
10. Lang K (1995) New sweeder: Learning to filter netnews , Proceedings of the 12th International Machine Learning Conference (ICML), ; pp.331-339
11. Li J , Liu C , Liu B (2013) Large scale sequential learning from partially labeled data , The 7th International Conference on Semantic Computing (IEEE ICSC 2013), ; pp.176-183
12. Liu B , Lee W S , Yu P S , Li X (2002) Partially supervised classification of text documents , ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning, ; pp.387-394
13. Masada T , Takasu A , Adachi J (2007) Accuracy of document classification with dirichlet mixtures , IPSJ Journal, Vol.48 ; pp.14-26
14. Miller D J , Uyar H S (1997) A mixture of experts classifier with learning based on both labeled and unlabelled data , Advances in Neural Information Processing Systems, ; pp.571-577
15. Minka T (2003) Estimating a Dirichlet Distribution , Available from: http://research.microsoft.com/~minka/papers/dirichlet/,
16. Mochihashi D , Kikui G (2006) Infinite dirichlet mixtures in text modeling , Technical report of IPSJ Natural Language Processing, Vol.36(2006-NL-172) ; pp.47-53
17. Nigam K , McCallum A , Thrun S , Mitchell T (2000) Text classification from labeled and unlabeled documents using EM , Machine Learning, Vol.39 (2) ; pp.103-134
18. Sadamitsu K , Mishina T , Yamamoto M (2005) Topic-based language models using dirichlet mixtures , IEICE TRANSACTIONS on Information and Systems, Vol.J88-D-II ; pp.1771-1779
19. Seeger M (2001) Learning with labeled and unlabeled data (Technical Report), University of Edingurgh,
20. Szummer M O (2002) Learning from Partially Labeled Data, Paper for Doctor of Philosopy, Massachusetts Institute of Technology,
21. Szummer M , Jaakkola T (2002a) Information regularization with partially labeled data , Advances in Neural Information Processing Systems, ; pp.1025-1032
22. Szummer M , Jaakkola T (2002b) Partially labeled classification with markov random walks , Advances in Neural Information Processing Systems, ; pp.945-952
23. Teh Y W , Jordan M I , Beal M J , Blei D M (2004) Hierarchical dirichlet processes , Journal of the American Statistics Association, Vol.101 ; pp.1566-1581
24. Ushiama N , Kumoi G , Ishida T , Goto M (2010) Document classification by sub-topic matching based on polya mixture distribution , IEICE Technical Report, Vol.110(IT2010 11-33) ; pp.13-18
25. Yamamoto M , Sadamitsu K (2005) Dirichlet mixtures in text modeling , Technical Report, CS-TR-05-1, University of Tsukuba