Neural Networks and Deep Learning

Idea

Neural networks are non-linear statistical models for regression or classification. Special simple forms are feed-forward Multi-Layer Perceptrons with backpropagation.

A schematic example of a Single-Layer Perceptron is shown in the figure below. It consists of the feature inputs (\(x\)), called input layer, and the outputs (\(y\)), called output layer, as in a classical regression model. In addition, it has a so-called hidden layer. The values in the hidden layer are also called derived features and are computed as functions of the inputs/features (\(x\)). The fitted values for outputs (\(y\)) are computed as functions of the derived features. The functions are called Activation Functions.

For categorical targets or outcomes (e.g., default vs. non-default) typically a non-linear sigmoid (logistic) function is used, which we know from Logistic Regressions:

\[\begin{align*} \sigma (v) = \frac{1}{1+\exp(-v)} \end{align*}\]

A network structure or architecture with more than one hidden layer is often called Deep Learning.

A neural network is fitted using the concept of Backpropagation by the following steps. We define a loss function. We typically use crossentropy for default risk. Subsequently, we apply regularization and an early stopping criterion for the fitting to reduce overfitting. Thereafter, we apply the gradient descent approach from output back to the inputs (called backpropagation) in the following way: In a forward step, we compute predicted values using fixed weights (coefficients). In a backward step, we compute the ‘errors’ for each layer. After that, we adjust the weights and repeat the process. The learning rate can be constant or variable (decreasing). The regularization parameter is typically estimated using cross-validation.

Simple Network without Hidden Layer

Let us consider an example that shows the analogy of regressions and neural networks. We look at a network with an input layer consisting of two features \(x\) and an output layer consisting of defaults \(y\).

We first create two standard normally distributed and independent random variables \(x_1\) and \(x_2\) with \(n=2000\) observations as training data (and also \(n=2000\) as test data). This is for convenience, so we no longer have to standardize the inputs. Next, we specify some weights \(w_1\) and \(w_2\) and a bias. We compute a linear predictor and train and test PDs after defining the logistic function (inv_logit()). We randomly create Bernoulli defaults from these. The \(x\)-variables are our inputs and the defaults (y_train and y_test, respectively) are the outputs. The figure shows the simple network architecture.

np.random.seed(1234)
n = 2000
sigma = 1
mean_X = [0, 0]
cov_X = [[1, 0], [0, 1]] 

X_train = np.random.multivariate_normal(mean_X, cov_X, n)
X_test = np.random.multivariate_normal(mean_X, cov_X, n)

bias = -1
w1 = 1
w2 = -0.8

def inv_logit(eta):
    return np.exp(eta) / (1 + np.exp(eta))

lin_pred_train = bias + w1 * X_train[:,0] +w2 * X_train[:,1]
lin_pred_test = bias + w1 * X_test[:,0] +w2 * X_test[:,1]

PD_train = inv_logit(lin_pred_train)
PD_test = inv_logit(lin_pred_test)

y_train = np.random.binomial(1, PD_train)
y_test = np.random.binomial(1, PD_test)

We then plot both training features and mark the output category (default/non-default) separately. We subsample them to be able to plot classes with different colors. The defaults are the unfilled circles with the tendency to lie in the lower right.

plt.figure()
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis', edgecolor='k', s=80)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Subsequently, we fit a Logistic Regression. In our case, this is actually the right specification for the model, because the data were generated this way. We check the performance for the test data. We see that the fitted parameters are close to the parameters of the data generating process.

model_lr = LogisticRegression(penalty='none', fit_intercept=True, solver='saga', n_jobs=2,  tol =1e-15, max_iter=2000)
model_lr.fit(X_train, y_train)

print('Coefficients:', model_lr.coef_.round(decimals=4))
print('Intercept:', model_lr.intercept_.round(decimals=4))

predictions = model_lr.predict_proba(X_test)[:,1].T
predictions_cat = model_lr.predict(X_test)

print(classification_report(y_test,predictions_cat))
print(confusion_matrix(y_test,predictions_cat))
print("Log Loss:", log_loss(y_test, predictions))
print("AUC:", roc_auc_score(y_test, predictions))
print("Brier score: (the smaller the better):", brier_score_loss(y_test, predictions))
Coefficients: [[ 1.0895 -0.7983]]
Intercept: [-0.9784]
              precision    recall  f1-score   support

           0       0.78      0.89      0.83      1341
           1       0.68      0.48      0.56       659

    accuracy                           0.75      2000
   macro avg       0.73      0.68      0.69      2000
weighted avg       0.74      0.75      0.74      2000

[[1194  147]
 [ 345  314]]
Log Loss: 0.5001806175080926
AUC: 0.7995844833029504
Brier score: (the smaller the better): 0.16547722815294402

We next fit a neural network and compare it with the Logistic Regression. We use MLPClassifier() from package scikit-learn for fitting. hidden_layer_sizes=() specifies the architecture of the network and the ith number in brackets sets the number of neurons in the ith hidden layer. As we only have an input and an output layer and no hidden layers, we leave these fields empty. We also specify the:

  • Random starting seed for generating initial weights;
  • Learning rate of the network;
  • Maximum number of iterations;
  • Stochastic gradient descent as solver;
  • Validation fraction for crossvalidation with option shuffle=True, which means that the crossvalidation samples are randomly drawn each time;
  • Early-stopping criterion;
  • Regularization \(\alpha\).

You can also turn the verbosity for the iteration steps on or off. You may specify verbose=True and the specified output will show the iteration history of the solver, which stops after a couple of iterations.

model_nn = MLPClassifier(hidden_layer_sizes=(), random_state=2, max_iter=500, 
                         solver='sgd', learning_rate='adaptive', learning_rate_init=0.1, validation_fraction=0.3,
                         verbose=False, n_iter_no_change=10, early_stopping=False, alpha=0.01, shuffle=True)
model_nn.fit(X_train, y_train)
MLPClassifier(alpha=0.01, hidden_layer_sizes=(), learning_rate='adaptive',
              learning_rate_init=0.1, max_iter=500, random_state=2,
              solver='sgd', validation_fraction=0.3)

We print the fitted weights and see that the values are close to those obtained by Logistic Regression.

print('Weights:', np.array(model_nn.coefs_).round(4))
print('Biases:', np.array(model_nn.intercepts_).round(4))
Weights: [[[ 1.0889]
  [-0.7982]]]
Biases: [[-0.9785]]

Below, we can see that the model performance for the test data is quite similar to the performance of Logistic Regression. This is not surprising because the data were generated according to Logistic Regression, i.e., a linear combination of features and a logistic (sigmoid) activation function.

predictions = model_nn.predict_proba(X_test)[:,1].T
predictions_cat = model_nn.predict(X_test)

print(classification_report(y_test,predictions_cat))
print(confusion_matrix(y_test,predictions_cat))
print("Log Loss:", log_loss(y_test, predictions))
print("AUC:", roc_auc_score(y_test, predictions))  
print("Brier score: (the smaller the better):", brier_score_loss(y_test, predictions))
              precision    recall  f1-score   support

           0       0.77      0.89      0.83      1341
           1       0.68      0.47      0.56       659

    accuracy                           0.75      2000
   macro avg       0.73      0.68      0.69      2000
weighted avg       0.74      0.75      0.74      2000

[[1194  147]
 [ 347  312]]
Log Loss: 0.5001799406611452
AUC: 0.7995833517215315
Brier score: (the smaller the better): 0.16547766658996946

Neural Network with Hidden Layers and Non-Linearity

Next, we create our samples by using the random number generator make_circles from scikit-learn. This creates a two-dimensional binary classification dataset by randomly drawing Gaussian data with a spherical decision boundary for binary classification. We also create train and test data and plot both. We see that the defaults are now inside a circle and non-defaults are outside.

from sklearn.datasets import make_circles
np.random.seed(1234)
n = 200

tmp1 = make_circles(n_samples=n, noise=0.2, factor=0.5, random_state=1)

X_train = tmp1[0]
y_train = tmp1[1]

tmp2 = make_circles(n_samples=n, noise=0.2, factor=0.5, random_state=2)

X_test = tmp2[0]
y_test = tmp2[1]
plt.subplots_adjust(hspace=0.4, wspace=0.4)

plt.subplot(221)
plt.title('Train')
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis', edgecolor='k', s=100)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.subplot(222)
plt.title('Test')
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='viridis', edgecolor='k', s=100)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.show()

Subsequently, we fit a Logistic Regression as before using \(x_1\) and \(x_2\). The validation metrics can not be interpreted on a standalone basis because they are data-dependent. Therefore, we fit a simple neural network for comparison.

model_lr = LogisticRegression(penalty='none', fit_intercept=True, solver='saga', n_jobs=2,  tol =1e-15, max_iter=2000)
model_lr.fit(X_train, y_train)

print(model_lr.coef_.round(decimals=4))
print(model_lr.intercept_.round(decimals=4))

predictions = model_lr.predict_proba(X_test)[:,1].T
predictions_cat = model_lr.predict(X_test)

print(classification_report(y_test,predictions_cat))
print(confusion_matrix(y_test,predictions_cat))
print("Log Loss:", log_loss(y_test, predictions))
print("AUC:", roc_auc_score(y_test, predictions))  
print("Brier score: (the smaller the better):", brier_score_loss(y_test, predictions))
[[-0.1341 -0.0327]]
[0.0022]
              precision    recall  f1-score   support

           0       0.50      0.51      0.50       100
           1       0.50      0.49      0.49       100

    accuracy                           0.50       200
   macro avg       0.50      0.50      0.50       200
weighted avg       0.50      0.50      0.50       200

[[51 49]
 [51 49]]
Log Loss: 0.6948995583035551
AUC: 0.4894
Brier score: (the smaller the better): 0.2508742518229471

We fit a neural network with one hidden layer with five neurons using hidden_layer_sizes=(5). We specify a hyperbolic tangens function (activation='tanh') as activation for the hidden layer, i.e., the connection between inputs and the layer. We turn off the verbosity because there are many steps.

model_nn = MLPClassifier(hidden_layer_sizes=(5), random_state=2, max_iter=1000, warm_start=True, 
                         activation='tanh', 
                         solver='sgd', learning_rate='adaptive', learning_rate_init=0.1, validation_fraction=0.3,
                         verbose=False, n_iter_no_change=10, early_stopping=False, alpha=0.01, shuffle=True)
model_nn.fit(X_train, y_train)
MLPClassifier(activation='tanh', alpha=0.01, hidden_layer_sizes=5,
              learning_rate='adaptive', learning_rate_init=0.1, max_iter=1000,
              random_state=2, solver='sgd', validation_fraction=0.3,
              warm_start=True)

The following code computes the total number of parameters. We estimate 21 coefficients in total as we have:

  • 15 coefficients for the connection between the features and the hidden layer: two features and one hidden layer with five neurons, i.e., two times five plus five biases;
  • Six coefficients for the connection between the hidden layer and the outputs: five neurons times one outcome plus one bias.
total_parameters = 0

for i in range(0,len(model_nn.coefs_)):
    print('Layer connection:', i, '| weight matrix shape:', np.array(model_nn.coefs_[i]).shape, 
          '| bias matrix shape:', np.array(model_nn.intercepts_[i]).shape) 
    total_parameters = total_parameters + np.array(model_nn.coefs_[i]).shape[0] * np.array(model_nn.coefs_[i]).shape[1] + model_nn.intercepts_[i].shape[0]

print('Total number of parameters:', total_parameters)
Layer connection: 0 | weight matrix shape: (2, 5) | bias matrix shape: (5,)
Layer connection: 1 | weight matrix shape: (5, 1) | bias matrix shape: (1,)
Total number of parameters: 21
predictions = model_nn.predict_proba(X_test)[:,1].T
predictions_cat = model_nn.predict(X_test)

print(classification_report(y_test,predictions_cat))
print(confusion_matrix(y_test,predictions_cat))
print("Log Loss:", log_loss(y_test, predictions))
print("AUC:", roc_auc_score(y_test, predictions)) 
print("Brier score: (the smaller the better):", brier_score_loss(y_test, predictions))
              precision    recall  f1-score   support

           0       0.85      0.92      0.88       100
           1       0.91      0.84      0.87       100

    accuracy                           0.88       200
   macro avg       0.88      0.88      0.88       200
weighted avg       0.88      0.88      0.88       200

[[92  8]
 [16 84]]
Log Loss: 0.272140813522943
AUC: 0.9551
Brier score: (the smaller the better): 0.08273066180945907

All performance measures have greatly improved. You may alter the hyperparameters of the neural network (e.g., the number of layers and neurons) and check the performance.