## 1. INTRODUCTION

Rate of software project failures is very high in comparison to the other projects although many investments to the new information system have been made (Altuwaijri and Khorsheed, 2012). Industry surveys show that only one-quarter of software projects are successful outright, and that billions of dollars are lost annually due to software project failures (Charette, 2005). There are many reasons that lead to software project failures. For example, customers often change their requests, specification of software project is more ambiguous than any other projects, and budget is insufficient (Lehtinen *et al*., 2014).

Due to the difficulties of software project, numerous studies regarding software project risks have been conducted. Sharma *et al*. (2011) explored risk dimensions of software project in India, and the identified risk dimensions are software requirement specification variability, team composition, control processes and dependability. Liu *et al*. (2011) founds that instable requirements would lead to potential interpersonal conflict that negatively affect performance of software projects. Fu *et al*. (2012) analyzed the impact of requirement changes on software project risk and established a probabilistic model using design structure matrix to evaluate the risk of requirement changes.

Quantitative methods can be used to reduce bias, and machine learning methods have specifically been applied to predict project risk. Hu *et al*. (2009) applied artificial neural networks (ANN) and support vector machine (SVM) to build an intelligent model to predict and control for software project risk according to the synthetic outcomes of project quality, time, and cost. Neumann (2002) suggested a principle component analysis (PCA)-ANN technique to estimate software risk and improve the ability to identify high-risk software. Hu *et al*. (2015) constructed software risk prediction model based on a classifier ensemble. The candidate classifiers which compose the ensemble are decision tree, SVM, ANN, Bayesian network, random forest and so forth, and among them, SVM outperformed in accuracy but decision tree outperformed in cost analysis. Hu *et al*. (2013) employed Bayesian network with causality constraints to build a software project risk prediction model, and this model predicted more accurately than other machine learning model such as logistic regression and decision tree. Verner *et al*. (2007) employed logistic regression model for predicting software project success under different culture contexts including United States and Australia. Reyes *et al*. (2011) developed a genetic algorithm based model to predict success probability for software projects.

Since existing studies adopting machine learning methods to predict project risk did not consider the processes, their findings may not be generalized because which processes are the actual causes of risk remain unknown. In addition, project risk during previous processes can affect risk during current or subsequent processes, and few previous studies have dealt effectively with this dependency issue. In order to resolve these problems, we present a method to predict project risk at each stage of the process given the status of risk factors that are potential causes of risk. This is done by introducing a probabilistic graphical modeling method that has been widely applied in labeling sequences and has demonstrated good performance in various fields. Because predicting the degree of risk for each process can be regarded as a sequence labeling problem, we adopt this method to predict software development risk at each stage. Therefore, our research novelty can be summarized as: we develop a probabilistic graphical framework to predict software project risk with the consideration of the fact: (1) a project consists of several processes, and (2) linear relationship between processes exists. Our model not only forecasts the risks, but also finds critical factors to analyze project risk.

The suggested framework in the paper is composed of five steps: (1) defining software project processes, (2) identifying software project risk factors as potential causes of risk, (3) modeling processes as a linear chain probabilistic graphical model by introducing two feature functions, (4) learning the model to estimate parameters, and (5) predicting the risk probability of each process and determining significant factors at each stage of the project. More precisely, we refer to ISO/IEC 12207, an international standard for software life cycle processes, to define the software project process in step (1). In step (2), we also identify project risk factors by referring to existing studies of software project risk. In step (3), feature functions representing the relationships between processes and corresponding risk factors that consider dependency among the processes are introduced. In step (4), we train and infer the model designed in step (3). Finally, methods used to calculate risk probabilities of processes when risk factors are realized and determine critical factors for project risk at each stage are described in step (5).

The remainder of this paper is organized as follows. In section 2, we provide a framework to predict software project risk. In section 3, we present our model using an artificially constructed data set and compare our results with those of other models. Finally, section 4 concludes the paper.

## 2. RISK PREDICTION FRAMEWORK

In this section, a framework to predict software project risk is provided. Each subsection includes a detailed description of one step, where is presented.

### 2.1 Defining Software Project Risk Processes

ISO/IEC 12207 is an international standard for software lifecycle processes and aims to define all tasks required to develop or maintain software. It divides a software project into 12 processes along with their specific tasks: *process implementation; system requirements analysis; system architecture design; software requirements analysis; software architectural design; software coding and testing; software integration; software qualification testing; system integration; system qualification testing; software installation; and software acceptance support*. In this paper, we formulate a mathematical model by adopting these processes and determine risk factors for each process. Regarding the relationship between process and factor risk, a probabilistic graphical model of linear chain type is constructed in order to predict risk for each process (Figure 1). This hypothetical software project includes 12 processes and several risk factors that are related to each process.

### 2.2 Identifying Software Project Risk Factors

We identify risk factors that could affect a software project based on the existing literature (Wallace and Keil, 2004; Schmidt *et al*., 2001; Addison and Vallabh, 2002; Addison, 2003; Han and Huang, 2007; Paré *et al*., 2008). A total of 61 factors are identified after eliminating overlapping factors and regarded as potential causes of risk (Table 1). We examine all risk factors and allocate each factor to the process to which the factor is most closely related and potentially affects. For example, the risk factors “Users with negative attitudes toward the project,” “Change in organizational management during the project,” and “Late changes to requirement” are closely related to implementation activity of the software project, and therefore, assigned to *process implementation*. Note that risk factors may be allocated to more than one process. For example, the factor “Lack of cooperation from users” is assigned to *process implementation*, *software requirements* analysis, and *software design*.

Then, each risk factor is translated into a variable, the type of which is either binary or ordinal. Specifically, if the risk factor can be described as a binary variable, then its value is 1 in cases of adverse outcomes, and 0 otherwise. Likewise, if an ordinal variable is necessary to describe the risk factor, then a five-point measure is applied, ranging from 1 ( = “nothing happens”) to 5 ( = “very risky issue happens”). Table 1 shows a partial list of risk factors and their variable types.

These factor variables are considered as random variables which realize values with some given probabilities. A detailed illustrative example is provided in Section 4.

### 2.3 Probabilistic Graphical Model

Let **x** = (**x**_{1}, **x**_{2},…, **x**_{12}) be the set of variable vectors, where each **x**_{i} , *i* =1, 2,⋯,12 is the vector of risk factors at process *i*. And **y** = (*y*_{1}, *y*_{2},…, *y*_{12}) be the vector of risk degrees of processes, where *y _{i}* ∈{H,M, L},

*i*=1, 2,⋯,12 is the degree of risk of process

*i*. Here, H, M, and L stand for “highly risky,” “medium risky,” and “low risky,” respectively. It should be noted that

**x**

_{i}’s are not independent of each other because a risk factor may appear in more than one process, as we described earlier.

Furthermore, project risk for the previous process usually affects risk for the current process or subsequent processes and, therefore, different values of *y _{i}* ’s may also depend upon each other. Because of these relationships among input variables (risk of factor) and among response variables (risk of process), project risk analysis often becomes intractable. If one uses a statistical model with many variables to estimate risk, for example, multicollinearity, a phenomenon in which two or more predictor variables are highly linearly related, might cause estimation results be unstable or even not quite. We resolve these problems by introducing two feature functions (state and transition feature functions) to build a probabilistic graphical model.

Let (**λ, μ**) = (*λ*_{1},*λ*_{2},…,*λ _{K}*,

*μ*

_{1},

*μ*

_{2},…,

*μ*

_{L}) be the parameter set that should be learned from a given training data set of size

*N*. Then, the probability density of the processes risks of a software project can be represented as the following conditional probabilities, or a linear chain conditional random field (CRF):

where *g _{j}* (•) is the

*j*

^{th}transition feature function of the risk factors and processes

*i*and

*i*−1. This set of transition feature functions captures the dependency structure of risks between two adjacent processes

*i*−1 and

*i*.

*fk*(•) is the k

^{th}state feature function of the process

*i*and risk factors. Any relationship that may exist between the risks of factors and corresponding processes can be reflected to the model by means of these status feature functions. Lastly,

*X*(

**x**) is a normalization function that ensures ${\sum}_{y}\text{p(}y\text{|}x\text{)}=1}.$

CRF is a framework for building probabilistic models to segment and label sequential data (Lafferty *et al*., 2001). It is used to encode known relationships between observations and construct consistent interpretations and has been applied in fields such as text processing, computer vision, and bioinformatics. Detailed explanations of CRF and its applications can be found in Chen *et al*. (2015), Blunsom and Cohn (2006), McCallum (2003), and Sha and Pereira (2003) and references therein.

Here, for simplicity, we consider one status feature function and one transition feature function (that is, *K* = *L* =1). We utilize an artificial neural network (ANN) to capture the complex relationships between risk factors and process risk. ANN is a well-known machine learning method utilizing the properties of biological neural networks (Figure 2). The main advantage of this method is its nonlinearity, allowing better fit to data, and high parallelism. In particular, it can handle various types of data and obtain good results in complex areas, including project risk management. Han (2015) and López-Martín (2015) adopted ANN to predict risk in software projects. The ANN employed in the paper has one hidden layer and five hidden nodes, as seen in Figure 2.

Then the status feature function $f({y}_{i},\hspace{0.17em}x,\hspace{0.17em}i)$ used in our CRF model is given as follows:

where ANN(**x***i*) is the response value of our ANN model when the input is **x***i*for the process *i*.

In addition, project risks for each process in the same project do not fluctuate sharply and since the processes interacts with each other (PMBOK® Guide 5th Edition, 2013, p. 48), therefore, the following transition feature function is adopted in our CRF model:

**Note**. More than one feature function can be introduced. For example, if a project manager wants to use a decision tree to represent the complex relationship between risk factors and process risk, then one can add another feature function such as:

where DT(**x***i*) is an output of the decision tree given **x***i*. Likewise, an additional transition feature function that may focus more on the serial processes of high risk can be introduced as follows:

### 2.4 Model Learning

Learning the model (actually, estimating the parameters) involves finding the parameter set (λ^{*}, μ^{*}) that maximizes the log likelihood of the training data,

where **x**^{(n)} and **y**^{(n)} denote the vectors of input variables (either binary or ordinal) and response variables (degree of the process risk) of the *n*^{th} sample. L(*λ , μ*) is a concave function, guaranteeing convergence to the global maximum, which means every local optimum is a global optimum and therefore, matrix computation, dynamic programming, and gradient ascent method can be used to find global optimum. In this paper, we adopt the gradient ascent method (Roth and Yih, 2005), which takes steps proportional to the positive value of the gradient of the function in Eq. (6) at the randomly selected initial point for finding *λ*^{*} and *μ*^{*}.

### 2.5 Inferencing and Determining Significant Factors at each Process of Project

Meanwhile, inference task is to find the most likeliness sequence **y**^{*} of degrees of process risks given observed risk factor values x as follows:

that is, if a set of values $\tilde{x}$ of all risk factors for a software project is given, then we can estimate the probabilities of the risk of each process, $\text{p}(y=({y}_{1},\text{\hspace{0.17em}}{y}_{2},\text{\hspace{0.17em}}\dots ,\text{\hspace{0.17em}}{y}_{12})|\tilde{x}),$ , from which the risk probability of the project may be derived.

Since there are numerous factors which affect project risk for each process, it is in general almost impossible to manage all risk factors. For efficient management, therefore, it would be better to determine significant factors that are critical for project risk. Olden and Jackson (2002) suggest a connection weight method that calculates the sum of products of raw weights of the connections from input node to hidden nodes and from hidden nodes to output nodes in an ANN. The larger the sum for a given input node is, the more important the corresponding input variable is. Relative importance, *R _{I}* , of a given input can be defined as:

where *h* is the total number of hidden nodes, *W _{IH}* is the weight of the connection between input node

*I*and hidden node

*H*, and

*W*is the weight of the connection between hidden node

_{HO}*H*and output node

*o*. In this study, we determine significant factors based on the values of

*R*. A detailed explanation of the method is given in the next section with an illustrative example.

_{I}## 3. MODEL APPLICATION

This section is devoted to illustrating and testing the model to demonstrate its applicability and describe its methodology in a hypothetical project case. In order for our model to be applied to any real project, data should be collected as a process scale for our model, and risks of each process and related risk factors should be recorded. Such real data, however, are not available because this kind of model has never been utilized in real project risk management situations. Therefore, we developed an illustrative test based on hypothetical software projects, in which the data set is produced based on reasonable assumptions.

We provide an overview of the artificially produced data set. As described in the subsections 2.1 and 2.2, risk factors are assigned to each process, where the variable representing each factor is either binary (0 or 1) or ordinal (5 point scale).

A total of 300 software projects and corresponding processes are manually produced, where one of the three risk degrees, H (high risk), M (medium risk) or L (low risk), is equally likely to be assigned to the first process for all projects. Then the subsequent risks of processes are determined with the following probabilities: Risk degree of the process *i*+1 is the same as that of the process *i* with probability 0.7, and is equally likely for the two other degrees. For example, if risk of the process *i* is M, then that of the process *i*+ 1 is M with probability 0.7, and H and L with probabilities 0.15, respectively.

Next, we assign probabilities to random variables of the risk factors according to variable type as shown in Table 2. For example, if the process *System Requirement Analysis* is H, then a bad issue related with the binary risk factor occurs (value = 1) with probability 60%, and 40%, otherwise.

With the dataset produced according to these rules, we construct a probabilistic graphical model. We also build ANN, decision tree, and naïve Bayes and logistic regression models to compare representative classification models and the proposed model in terms of accuracy. Accuracies are calculated by 5-fold cross validation. The results are summarized in Table 3.

Table 3 shows that the accuracies of our model are the highest among all five models. For example, our model could predict degrees of risk of the process *System Requirement Analysis* with an accuracy of 97.33%, while the other four machine learning methods show accuracies of 80.67~86.33% for our 300 software project case. On average, our model shows 92.81% accuracy, compared to 78.00~87.56% for other models. This may be a natural consequence because our model considers not only relationships between inputs and outputs using status feature functions, but also relationships among outputs by means of transition feature functions.

Now, we illustrate the inference process using an example. Let ${x}_{1}^{(72)}$
and ${x}_{2}^{(72)}$
be the vectors of risk factors of the first process (*process implementation*) and the second process (*System Requirement Analysis*) of the arbitrarily chosen 72th software project, respectively. The output of ANN given ${x}_{1}^{(72)}$
is H, and therefore, the risk degree of the first process is H. Recalling that *Z*(_{x}) is independent of _{y}, the following three values are computed and compared:

Therefore, the most plausible risk degree of the second process is H. This inference procedure repeats to the whole 12 processes.

Additionally, we calculate the relative importance *R _{I}* of risk factors of each process to identify critical factors for software project risk using Eq. (9). Table 4 lists the results for our dataset.

## 4. CONCLUSION

Project risk management is a major topic of interest in the field of project management. Many standards of project management adopt and apply procedural approaches. Project management is an integrative undertaking that requires each project process to be appropriately aligned and connected with other processes to facilitate coordination.

Since actions taken during one process typically affect that process and other related processes, we suggest a software project risk prediction model based on the probabilistic graphical model, which is designed to predict risks for each process. We applied an ANN to construct a feature defining the state of the process, expressing relationships between project risk factors and process risk. We also defined a transition feature function expressing relationships among process risks, under the assumption that project risks do not fluctuate sharply.

We artificially generated a dataset to validate our model while considering risk probability. As a first step, we compared the accuracy of our model with four other well-known machine-learning models and found that our model outperforms all others. In addition, critical factors of each process that mostly affect project risk could be identified by their relative importance, by which overall risk management can be accomplished effectively and efficiently.

As real data suitable for testing our model are not currently available, an illustrative test problem was developed based on hypothetical software projects. In the future, our model can be applied to real data when they are available by following the steps below:

(1) For all completed software projects, separate the software project into processes according to ISO/IEC 12207; (2) identify risk factors for each process; (3) assign a risk value to each risk factor by referring to Table 1; (4) make a data structure for each process using the values assigned in (3); (5) construct an ANN model to predict process risk level using the data for each process made in (4); (6) construct a CRF model referring to section 2.3 by using the ANN model as feature *f*(*y _{i}*,

*x*,

*i*) referring to Eq. (2); (7) assign a score to each risk factor for a software project and using this information, predict risk for each process; and (8) determine significant risk factors for each process by referring to section 2.5.

Managerial implication of the framework can be summarized as follows. First, risk management of a software project should be done on process unit. Through the literature review and the model application, we found that the previous process usually affects risk for the current process or subsequent processes. In addition, our model based on the fact shows better prediction performance than other machine learning methods which do not consider the relationship between two consecutive processes. Second, the degree of risks at each process should be measured to develop and employ the model in the real world. Finally, the model should be developed and updated with the recent project data to identify critical factors at each process.