With the development of economic globalization and financial liberalization, credit assessment plays an important role in maintaining the normal relationship of social economy. Personal credit assessment requires establishing calibration models with statistic methods. The mono-method-based models are not capable to simultaneously hold the robustness, interpretation and prediction accuracy of the models. In this paper, back-propagation neural network (BPNN) was used to generate a new comprehensive variable for logistic regression (LR) by tuning the number of hidden nodes. The optimal back-propagation neural network-logistic regression combination model (BPNN-LR) was established with 5 input nodes, 7 hidden nodes and 1 output node. The model performance was slightly improved. The prediction accuracy was raised up to 86.33% and 87.96% for the training samples and the test samples, respectively. Results showed that the BPNN-LR model had higher classification accuracy than the LR model. It is concluded that the outcome performance provides technical reference for the corporation’s decision making.

Credit maintains the normal relationship of social economy. Personal credit is the basis of the social credit. As a result of the imperfect personal credit system, the confusion of personal credit relationship, fraud, repudiation and other dishonesty behaviors prevail universally. This situation has not only become a huge obstacle to the development of market economy. But also has seriously affected the progress and development of society [

Statistic methods are widely used for the personal credit assessment. Linear discriminant and logistic regression (LR) are the basic robust and interpretative linear methods. They generate a linear scorecard to evaluate customer’s credit [

The records of customers’ behaviors (January-June, 2015) were collected from the transaction data of China Unicom. The numerical variables were transformed using normalization and classification variables are transformed to the binary classification form. The contribution of each variable was evaluated by IV and some of the most contributive valuables were selected for model calibration and prediction. The selected transformed variables were the basic data for model optimization. First, the LR models were established to calculate the probability of being bad credit for each customer, in order to discriminate that this customer has good credit or bad credit. Then, a BPNN model with one hidden layer was used to generate a new comprehensive variable for model optimization by tuning and selecting the number of hidden nodes. Thus, an optimal back-propaga- tion neural network-logistic regression combination model (BPNN-LR) was determined for the evaluation of customer’s personal credit, so that the LR classification accuracy was improved. In our study, Python and Matlab are combined used for sample classification, data transformation and model optimization.

A total of 4518 customer behavior records were collected from transaction data of China Unicom. These records from January to June 2015 were scrambled as samples for establishing statistic models. And each sample included 24 variables (as shown in

Establishing personal credit assessment model was used to predict the value of customer credibility (Y) with 23 characteristic variables, so that the customers could be discriminated as “good” or “bad” customers. The 23 characteristic variables included 16 numerical variables (T) and 7 classification variables (C). It was necessary to normalize the numerical variables, because they were acquired

Classification variable | Numerical variable | ||||
---|---|---|---|---|---|

Variable | Symbol | Variable | Symbol | Variable | Symbol |

Customer credibility | Y | Times of payment arrearage | T1 | Totalnumber of communicational friends | T9 |

IS converged with other service | C1 | Amount of payment arrearage | T2 | Number of intranet communicational friends | T10 |

IS real-name registered | C2 | Online duration | T3 | Number of outer communicational friends | T11 |

IS bound with bankcards | C3 | Accumulative days of communication | T4 | Accumulation of used data | T12 |

IS an attractive number | C4 | Times of accumulative calls | T5 | Monthly average of the on-bill amounts | T13 |

IS registered to APP’s | C5 | Monthly average of arrearages | T6 | Monthly average of payments | T14 |

IS a grouped number | C6 | Accumulative callduration | T7 | Accumulation of overdue payment arrearages | T15 |

On/off line | C7 | Times of being suspended | T8 | Times of international roaming | T16 |

using different measurement scales and valued in different algebraic ranges. While the seven classification variables had different number of classification target. For example, the values of “IS converged with other service” (C1) included 0, 1, 2 and 3. This valuing diversity was probable to result in high dimensional calculation. In order to reduce model’s complexity, they were transformed to binary classification variable, equaling to either 0 or 1.

The contributions of 23 characteristic variables were different. The calibration model could be much complex and the prediction accuracy be affected if all the 23 variables were used for personal credit assessment. Therefore, it was necessary to select characteristic variables according to their contribution. The contributions of variables are quantitatively measured by calculating the Weight of Evidence (WOE) for divided sample groups. WOE is used to measure the difference between variables and between different sample groups [_{i} for the i-th group was calculated as follows,

WOE i = ln ( G i / G B i / B )

where G represented the “good” customers. B represented the “bad” customers. The Information Value (IV) of each variable is successively defined by weighted sum of the grouping WOE’s,

I V = ∑ i = 1 n WOE i × ( G i G − B i B )

A larger IV represent the variable contributes greater to the models [

Logistic regression is a common classification model in machine learning field. The personal credit was estimated by establishing logistic regression models with the selected transformed variables. To achieve the calibrating and predicting tasks, the customer credibility (Y) was defined as the dependent variable. And the selected variables were defined as the independent variables (labeled as X_{1}, X_{2}, ・・・, X_{n}). The logistic regression formula was as follows,

ln ( P 1 − P ) = β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β n X n

where, P was the probability of a customer being “bad”, i.e. P = P (Y = 1). The P-value resulting from logistic regression model was used to distinguish “good” customers and “bad” customers.

The BPNN-LR model was to utilize back-propagation neural network to train a new comprehensive valuable for logistic regression. The neural network was formed as having one hidden layer with k hidden neurons. The parameters (the weights) of the whole network should be trained by iteration.

Step 1, the weights of any neuron between the input and the hidden layer (V_{ij}) were preliminarily initialized. The selected characteristic variables X_{1}, X_{2}, ・・・, X_{n} were input and the weighted sum of them was passed to the hidden layer by logistic function. The output of the hidden layer was defined as follows,

H j = f ( ∑ i = 0 n V i j X i ) , j = 1 , 2 , ⋯ , k

Step 2, the weights of any neuron between the hidden and the output layer (W_{jp}) were initialized. The output of the hidden layer were weighted and summed, then transferred to the output layer. The final output of the neural network was as follows,

Y p = f ( ∑ j = 0 k W j p H j ) , p = 1 , 2 , ⋯ , m

where, the logistic function was used as the transfer function f, V_{ij} was defined as the weight of the i-th input node and the j-th hidden node and W_{jp} was defined as weight of the j-th hidden node and the p-th output. For each weighted sum, the i and j equaling to 0 meant that V_{0j} and W_{0p} were the alternatives of the thresholds of the neurons when X_{0} equaled to 1 and H_{0} equaled to 1.

The output of BP neural network was regarded as a new comprehensive variable added to logistic regression, so that the predictions of customer credibility of all samples were accomplished. According to the predictive results, the optimal number of hidden neurons was selected for the improvement of the BPNN-LR model and the “good/bad” customers identified.

A total of 4518 samples were randomly divided into training set and test set at around the ratio of 3:1, containing 3388 training samples and 1130 test samples. Each sample included 24 variables. One of these variables was customer credibility (Y), by which the customers could be discriminated as “good” (valued as 0) or “bad” (valued as 1). We defined the customer credibility as the dependent variable and the 23 characteristic variables as the independent variables to establish LR and BPNN-LR models for personal credit assessment, so that the customers could be distinguished as “good” or “bad”.

In order to reduce the complexity of the model, it was necessary to select variables according to their contribution (i.e. the IV values). A larger IV represent the variable contributes greater to the models. The variables were sorted in descending order of IV (shown in

The logistic regression method was used to establish the personal credit assessment model to distinguish “good” customers and “bad” customers by the P-value resulting from the model. The training samples with the most contributive characteristic valuables were input to the model to predict the P-values of the test samples. If P was greater than 0.5, the sample was classified as “bad” customers (Y = 1), otherwise it was classified as “good” customers (Y = 0).

Results of the logistic regression model were shown in

In order to improve the predictive performance of the logistic regression model, the BPNN method was used to generate a new comprehensive valuable for logistic regression model. And a BPNN-LR model was established. To select the number of BPNN’s hidden neurons (k), we established the models with tuning the number of hidden neurons (from 4 to 10). Initializing the weights between the input and the hidden layer (V_{ij}), and between the hidden and the output layer (W_{j}_{1}), the logistic function was used as the transfer function. Iterative training the parameters (the weights) of the whole network, the classification accuracy of the test samples corresponding to different k values were shown in _{01} = −4.14, W_{11} = −3.48, W_{21} = −0.71, W_{31} = 0.34, W_{41} = 6.81, W_{51} = −3.04, W_{61} = −3.37 and W_{71} = 5.33. Results of the BPNN-LR model were shown in

In this paper, the customer transaction data from China Unicom were used to establish personal credit assessment model. These data were from January to

Training set | Prediction | Accuracy (%) | Test set | Prediction | Accuracy (%) | ||||
---|---|---|---|---|---|---|---|---|---|

0 | 1 | 0 | 1 | ||||||

Reality | 0 | 1784 | 129 | 93.25 | Reality | 0 | 572 | 89 | 86.53 |

1 | 337 | 1138 | 77.15 | 1 | 82 | 387 | 84.00 | ||

Total accuracy (%) | 86.25 | Total accuracy (%) | 84.87 |

Training set | Prediction | Accuracy (%) | Test set | Prediction | Accuracy (%) | ||||
---|---|---|---|---|---|---|---|---|---|

0 | 1 | 0 | 1 | ||||||

Reality | 0 | 1789 | 124 | 93.52 | Reality | 0 | 630 | 31 | 95.31 |

1 | 339 | 1136 | 77.02 | 1 | 105 | 364 | 77.61 | ||

Total accuracy (%) | 86.33 | Total accuracy (%) | 87.96 |

June 2015. The LR model and the BPNN-LR model were respectively established to discriminate between “good” customers and “bad” customers. First of all, the numerical variables were normalized and classification variables were transformed to binary classification variables. Then, variables were selected according to their contribution measured by calculating the IV. The selected five variables with the largest IV were used to establish the model and predict. The P-value resulting from the logistic regression model was used to distinguish “good” customers and “bad” customers. The accuracy of the training samples was 86.25%. And the accuracy of the test samples was 84.87%. In order to optimize the model, BPNN method was used to generate a new comprehensive valuable for logistic regression model. And a BPNN-LR model was established. The optimal BPNN included 5 input nodes, 7 hidden nodes and 1 output node. The classification accuracy of the training samples was 86.33%. And the classification accuracy of the test samples was 87.96%. The results showed that the classification accuracy of the BPNN-LR model was higher than the LR model. This indicated that the new comprehensive valuable generated by BPNN was feasible to highly interpret the characteristic valuables. The method of BPNN combined with LR modeling provided technical references to distinguish “good” customers and “bad” customers and provide decision-making basis for a corporation.

This work was supported by National Natural Scientific Foundation of China (61505037) and Natural Scientific Foundations of Guangxi (2016GXNSFBA- 380077, 2015GXNSFBA139259).

