## 1.INTRODUCTION

Semiconductor industry is often characterized with several unique attributes such as high product quality, short life cycle, reduced lead times, declining costs, market volatility, increased device complexity (Arisha and Young, 2005). The processes involved in semiconductor manufacturing are complex, costly and lengthy including hundreds of sequential steps that aid in the functional circuitry of the chip (Huang, 2007; Wang, 2008). Maintaining process quality and effective process control is absolutely critical to improve effective yield rates. A wafer is an elementary unit in semiconductor manufacturing and several hundred integrated circuits (ICs) are simultaneously fabricated on a single wafer (Fenner *et al*., 2005). After fabrication process such as etching and deposition, each chip undergoes a series of functional quality checks to be classified as either functional or defective. The most important step in semiconductor process monitoring is data collection on the quality of the chips during post fabrication process. Semiconductor wafer maps, which are a graphical illustration of the locations of defective chips on a wafer, provide valuable information on the location of defective chips graphically. The information contained in a semiconductor wafer map consists of binary codes, ‘1’ or ‘0’ to display the locations of defective or functional chips on the wafer.

Defective chips on the wafer map can occur in ran- dom pattern or display some systematic defect patterns such as edge ring, linear scratch, zone type, and mixed shapes (Wang *et al*., 2006; Hansen *et al*., 1997). Such defect patterns contain useful information to identify root causes of the out of control process (Cunningham and McKinnon, 1998). Various factors such as uneven temperature exposure during thermal annealing or chemical aging can lead to various spatial clusters on the wafer map. Clusters also can be the result of crystalline nonuniformity, photo-mask misalignment or particles due to electro-mechanical vibrations. Stepper and/or probe malfunctioning and sawing imperfections can lead to repetitive patterns. Material shipping and handling also can leave a scratch on the wafer map (Cunningham and McKinnon, 1998; Hansen *et al*., 1997; Hansen and Thyregod, 1998; Taam and Hamada, 1993). Defect pattern recognition from the information obtained in wafer map has been traditionally performed through visual inspection with the aid of scanning electron microscope (SEM), this leads to a heavy reliance on the knowledge of quality engineers, domain expertise and sound judgment. This presents a strong need to develop and use automatic defect detection and classification methods using the enormous data arising from various steps of semiconductor manufacturing.

Since the defect patterns represented on the wafer map contain important information for understanding of the ongoing manufacturing processes, several novel and important methods have been developed to study the automatic classification of defect patterns (Chao and Tong, 2009; Chen and Liu, 2000; Jeong *et al*., 2008; Li and Huang, 2009; Liu *et al*., 2002; Wang *et al*., 2006; Wang, 2008; Jeong *et al*., 2012; Yuan and Kuo, 2008). In general, defects patterns in semiconductor wafers occur clustered and not uniformly distributed. Recently, Jeong *et al*. (2008) proposed a spatial correlogram-based classification methodology, which combines the K-nearest neighbor classifier with dynamic time warping (DTW) distance measure for automatic defect patterns classification on semiconductor wafer maps. DTW defines the minimum distance between the two time series by allowing a nonlinear mapping of the one sequence to another.

However, the drawback of conventional DTW is that all points in a sequence should be matched with equal weight of each point so that outliers can distort minimum distance. In order to overcome the drawback of standard DTW, Jeong *et al*. (2011) developed a novel distance measure called weighted dynamic time warping (WDTW) measure, which penalizes the distance between points on each sequence based on the phase difference. When defect patterns occur on wafer maps, defective chips are usually clustered at certain locations. Thus, the comparison of value in spatial correlogram at the *same lag* (or *neighboring lags*) between two spatial correlograms is more meaningful when they are compared for defect pattern classifications. In other words, phase difference between two points should be considered based on penalizing points when the distance between two points on each correlogram is calculated.

Therefore, the main objective of this paper is to employ a technique in data mining and optimization to classify defect patterns on wafer maps. The proposed technique is based on the support vector machines with weighted dynamic time warping kernel (SVM-WDTWK), which provides a flexible and robust matching algorithm for time series classification. We evaluate and assess the performance of the proposed approach on a wafer dataset with four types of defect patterns.

The rest of the paper is organized as follows. We introduce the basic concept of the related methodology of spatial correlogram and dynamic time warping technique in Section 2. Section 3 presents a novel classification technique, namely SVM-WDTWK, for automatic classification of defect patterns on wafer maps. The experimental results are presented in Section 4. We present conclusions and some future research directions in section 5.

## 2.RELATED METHODOLOGY

### 2.1.Measure of Spatial Dependence

Because we focus on the binary map, the wafer map data is binary, where ‘1’ indicates a defective chip and ‘0’ indicates a functional chip. Thus, the spatial dependences among chips can be measured using join-count statistics. A join is formed when two chips are located in the neighborhood of each other. The number of possible joins is given by(1)

where

where (*i*, *j*)∈*N* implies that two chips *i* and *j* are neighbors.

Let *y _{i}* represents indicator variable where if

*y*= 1, then the chip is defective; conversely,

_{i}*y*= 0 means the chip is functional. By using

_{i}*y*, three types of join can be calculated as follows,(2)

_{i}

where *jc*_{00} : = the number of joins among neighbors that connect two functional chips, *jc*_{01} : = number of joins among neighbors that connect a functional and a defective chip, and *jc*_{11} : = number of joins among neighbors that connect two defective chips. By the definition of *jc*_{00}, *jc*_{01}, and *jc*_{11},(3)

Several existing techniques to identify defect patterns on semiconductor wafer map use spatial statistics (Cunningham and McKinnon, 1998; Hansen *et al*., 1997; Taam and Hamada, 1993), as pointed out by Hansen and Thyregod (1998), a single monitoring statistic is not sufficient to represent a variety of widespread patterns across the wafer map. To overcome this drawback, Jeong *et al*. (2008) proposed a spatial correlogram, which can be represented by using join count statistic with multiple spatial lags, for analysis of spatial defect patterns on semiconductor wafer maps. In order to create spatial correlogram, they developed a generalized join count based statistic T(d) with *d*^{th}-order neighbors as follows:(4)

where *p* is defective rate. In addition, *jc*_{00} (*d*) and *jc*_{11} (*d*) are the number of *d*^{th}-order neighbors among functional chips and among defective ones, respectively. The mean and variance of statistic T(d) is given by(5)

and the standardized statistic *T*(*d*) can be approximated as normal distribution as follows (Jeong *et al*., 2008)(6)

where *jc(d)* = *jc*_{00}(*d*) + *jc*_{11} (*d*) + *jc*_{01} (*d*).

### 2.2.Dynamic Time Warping

The dynamic time warping (DTW), which has been popular in speech and signature recognition applications, finds an optimal match between two time series data by allowing a nonlinear mapping of the one sequence to another by minimizing the distance between the two sequences (Keogh and Ratanamahatana, 2005). DTW distance makes nonlinear alignments to be possible while Euclidean distance are aligned one to one. Figure 1 illustrates the optimal warping path between two sequences determined by DTW.

Suppose a sequence *S* of length m, $S={s}_{1},\text{\hspace{0.17em}}{s}_{2},\text{\hspace{0.17em}}\cdots ,\text{\hspace{0.17em}}{s}_{i},\text{\hspace{0.17em}}\cdots ,\text{\hspace{0.17em}}{s}_{m}$ and a sequence *R* of length n, $R={r}_{1},\text{\hspace{0.17em}}{r}_{2},\text{\hspace{0.17em}}\cdots ,\text{\hspace{0.17em}}{r}_{j},\text{\hspace{0.17em}}\cdots ,\text{\hspace{0.17em}}{r}_{n}$. We create n-by-m path matrix where the (*i ^{th}*,

*j*) element of the matrix contains the distance between the two points

^{th}*s*and

_{i}*r*such as $d\left({s}_{i},\text{\hspace{0.17em}}{r}_{j}\right)={\Vert \left({s}_{i}-{r}_{j}\right)\Vert}_{p}$, which ${\Vert \cdot \Vert}_{p}$ represents the

_{j}*l*norm. The best match between these two sequences is the one for which there is the lowest distance path aligning the one sequence to the other. Therefore, the optimal warping path can be found by using recursive formula given by(7)

_{p}

where *γ* (*i*, *j*) is the cumulative distance described by(8)

Thus, *DTW _{p}* can be seen as the minimization of the

*l*distance under warping.

_{p}In addition, as a new distance measure for time series classification, Jeong *et al*. (2011) proposed the penaltybased DTW, called weighted dynamic time warping (WDTW), which weights nearer neighbors more heavily depending on the phase difference between a reference point and a testing point. Because WDTW considers the relative importance of the phase difference between two points, this approach would not allow a point in a sequence from mapping the further points in another one, preventing a minimum distance distortion caused by outliers.

## 3.SVM WITH WEIGHTED DYNAMIC TIME WARPING KERNEL FUNCTION

The goal of support vector machine (SVM) classifier is to make the margin as large as possible, and at the same time, to keep the number of points that are misclassified as small as possible.(9)

where *C*(> 0) represents the trade-off parameter, minimizing the training error and maximizing the margin and the slack variables *ξ _{i}* corresponds to the deviation size of misclassified samples. By adding Lagrangian multiplier

*α*and using the appropriate Karush-Kuhn-Tucker (KTT) conditions, the primal formulation of the optimization problem yields following dual form:(10)

For non-linear applications, SVM can apply the appropriate kernel function, *K*(**x**_{i}, **x**_{j}), to the dot product of input vectors. The key idea of kernel functions is to transform non-linear operations in input space **x**_{i} to linear operations in the higher feature space. A kernel function can be considered as a similarity measure in the input space (Scholkopf, 2000). For example, the most commonly used kernel function is the radial basis function (RBF) given by,(11)

The similarity of two samples in terms of RBF kernel can be interpreted as their Euclidean distance. In other word, standard SVM assumes that each sample has same dimension and aligned one to one between samples. This property of standard kernel could be a critical drawback especially for time series classification. In order to overcome this drawback, SVM with dynamic time warping kernel (SVM-DTWK) has recently been proposed for time series classification (Bahlmann *et al*., 2002; Lei and Sun, 2008; Shimodaira *et al*., 2001). In SVM-DTWK, kernel function is modified suitably as(12)

where *D _{p}* (⋅,⋅) indicates the DTW distance with

*l*norm between two sequences

_{p}**u**and

**v**. In this paper, two sequences

**u**= (

*u*

_{1}, …,

*u*

_{38}) and

**v**= (

*v*

_{1},…,

*v*

_{38}) are correlograms with the length of 38 generated by wafer maps. In DTW kernel, time series data are “warped” nonlinearly to determine their similarity independent of any non-linear variations in the time dimension. However, standard DTW does not account for the relative importance regarding the phase difference between a reference point and a testing point. This may lead to misclassification especially in applications where the shape similarity between two sequences is a major consideration for an accurate recognition, thus neighboring points between two sequences are more important than others. In other words, relative significance depending on the phase difference between points should be considered.

Therefore, in this study, we present a support vector machine with weighted dynamic time warping kernel (SVM-WDTWK), which is based on a penalized DTW distance measure proposed by Jeong *et al*. (2011). The WDTW kernel can be expressed as(13)

where *WD _{p}* (⋅,⋅) indicates the weighted DTW distance with

*l*norm. In WDTW distance, depending on the phase difference $\left|i-j\right|$ between two points

_{p}*u*and

_{i}*v*, different weight value would be imposed. Thus, the optimal distance between the two sequences is defined as the minimum path over all possible paths as follows:(14)

_{j}

where *γ _{w}* (

*i*,

*j*) is the cumulative weighted distance described by:(15)

where ${w}_{\left|i-j\right|}$ is a positive weight value between the two points *u _{i}* and

*v*.

_{j}In addition, in order to systematically assign weight as a function of the phase difference between two points, we present a modified logistic weight function (Jeong *et al*., 2011), which is defined as(16)

where *i* = 1, … *m*, *m* is the length of a sequence and *m _{c}* is the midpoint of a sequence.

*w*

_{max}is the desired upper bound for the weight parameter, and

*g*is an empirical constant that controls the curvature (slope) of the function; that is,

*g*controls the level of penalization for the points with larger phase difference. For example, when

*g*= 0.25, the weight function follows a sigmoid pattern. In addition, all weight values are same with

*g*= 0. In addition, the first one-half is given one weight and the second one-half is given another weight when

*g*= 3. Even though there are several weight functions, a form of logistic weight function have showed better performance in diverse applications (Omitaomu, 2006). A form of logistic weight function assigns heavier weight to recent observations because recent data are more significant than past ones.

The dual formulation of SVM-WDTWK by adding Lagrangian multiplier *β* is expressed by

Note that the same algorithms to solve standard SVM can be used to solve SVM-WDTWK as well.

## 4.EXPERIMENTAL RESULTS

For the experiments, we generated a total 640 wafers of 20 by 20-sized maps with four patterns such as circle, cluster, repetition and spot (160 wafer maps of each pattern). The generation procedure was followed by DeNicolao *et al*. (2003). In addition, we added eight level of random noise ranging from 0.05, 0.1, 0.15, …, 0.4. For example, Dataset {1} consisted of wafer maps with the noise level of 0.05, Dataset {2} with the noise level of 0.1, and so on. Figure 2 presents typical four classes of defect patterns and their corresponding spatial correlograms.

The four-fold cross validation (CV) was implemented for the comparison of classification accuracy of different procedures: One nearest neighbor classifier with Euclidean distance (1-NN-ED), one nearest neighbor classifier with DTW (1-NN-DTW), one nearest neighbor classifier with WDTW (1-NN-WDTW), SVM with Euclidean distance kernel (SVM-EDK), SVM-DTWK (SVM-DTWK), and SVM-WDTWK (SVM-WDTWK). All parameters in SVM and value of g in weighting function were optimized by using validating dataset. Because the number of defect pattern has more than two classes, this study utilized a multi- class SVM for wafer defect pattern recognition. In order to apply SVM to multiclass classification problems, two approaches had been developed such that one is the oneagainst- all strategy to classify between each class and all the remaining; the other is the one-against-one strategy to classify between each pair. See the references in details for applying SVMs to multiclass classification (Chao and Tong, 2009; Hsu and Lin, 2002; Li and Huang, 2009). In this study, we utilized the one-against-one strategy because the one-against-one strategy had produced better (or considerable) performance in the previous researches (Hsu and Lin, 2002; Li and Huang, 2009). In addition, the parameter values for weighting function MLWF were set as follows; *w*_{min} and *w*_{max} are set to 0 and 1 for *m* = 38 and *m _{c}* =19 because the maximum number of spatial lag of 20 by 20- sized wafer map was 38.

Table 1 showed the accuracy of six techniques for both average and each fold of four-fold CV datasets. In this work, the accuracy was calculated as follows;

The experimental results indicated that SVMWDTWK performed strongly over other methods with an average accuracy of 93.1%. Compared with the performance between NN classifier and SVM classifier, SVM classifier yielded the consistent accuracy in each fold. In addition, DTW kernel function-based SVM demonstrated better accuracy than standard kernel-based SVM, showing that DTW kernel was promising method for defect pattern classifications using spatial correlograms on semiconductor wafers. Because the standard deviation of each method was higher, from the viewpoint of statistics, SVM-WDTWK method could not be preferable compared with SVM-EDK and SVM-DTWK, but the accuracy trend of the proposed method was promising in each dataset. Thus, the experimental results suggested that SVM-WDTWK was an effective algorithm for automatic defect classification on wafer maps using spatial correlograms.

In addition, we presented the rational reason that the proposed WDTW was better than a conventional DTW. Figure 3 presents a group of circle pattern and the corresponding spatial correlograms. In this figure, X axis represents a spatial lag (*d*) and Y axis indicates its corresponding statistic *Z _{T}*(

*d*), which was described in Section 2. As shown in Figure 3, the shape of spatial correlograms was similar, but not exactly same. Thus, to classify those wafers into same class, the comparison of statistic value at the same lag (or neighboring lags) between two correlograms is more meaningful when they are compared for defect pattern classification. In the proposed approach, the higher

*d*value, the more penalizing to points with higher phase difference to determine the optimal weights.

## 5.CONCLUSIONS

Defect patterns on semiconductor wafer maps have been used for monitoring process status in semiconductor industry. However, the operative detections are still manual, error prone based on subjective criteria without any automated methods. There presents a strong need to develop automated methodologies that can aid process engineers to quickly recognize process problems and to track root causes of out of control process. We presented a novel classification technique called support vector machines with weighted dynamic time warping kernel (SVM-WDTWK) to classify defect patterns on wafers through spatial correlogram of a binary wafer map. Based on our presented approach, a classification accuracy of more than 93% was achieved. Although the results are quite promising and superior to several existing methods, improvements of our defect pattern classification approaches need further investigation. We postulate that classification accuracy could be increased with novel features. Thus a wafer bin map, which is more informative than the binary one, may be an improved feature for this problem and are being investigated.