Skip to contents

fits a custom deep neural network using the Multilayer Perceptron architecture. dnn() supports the formula syntax and allows to customize the neural network to a maximal degree.

Usage

dnn(
  formula = NULL,
  data = NULL,
  hidden = c(50L, 50L),
  activation = "selu",
  bias = TRUE,
  dropout = 0,
  loss = c("mse", "mae", "softmax", "cross-entropy", "gaussian", "binomial", "poisson",
    "mvp", "nbinom"),
  validation = 0,
  lambda = 0,
  alpha = 0.5,
  optimizer = c("sgd", "adam", "adadelta", "adagrad", "rmsprop", "rprop"),
  lr = 0.01,
  batchsize = NULL,
  burnin = 30,
  baseloss = NULL,
  shuffle = TRUE,
  epochs = 100,
  bootstrap = NULL,
  bootstrap_parallel = FALSE,
  plot = TRUE,
  verbose = TRUE,
  lr_scheduler = NULL,
  custom_parameters = NULL,
  device = c("cpu", "cuda", "mps"),
  early_stopping = FALSE,
  tuning = config_tuning(),
  X = NULL,
  Y = NULL
)

Arguments

formula

an object of class "formula": a description of the model that should be fitted

data

matrix or data.frame with features/predictors and response variable

hidden

hidden units in layers, length of hidden corresponds to number of layers

activation

activation functions, can be of length one, or a vector of different activation functions for each layer

bias

whether use biases in the layers, can be of length one, or a vector (number of hidden layers + 1 (last layer)) of logicals for each layer.

dropout

dropout rate, probability of a node getting left out during training (see nn_dropout)

loss

loss after which network should be optimized. Can also be distribution from the stats package or own function, see details

validation

percentage of data set that should be taken as validation set (chosen randomly)

lambda

strength of regularization: lambda penalty, \(\lambda * (L1 + L2)\) (see alpha)

alpha

add L1/L2 regularization to training \((1 - \alpha) * |weights| + \alpha ||weights||^2\) will get added for each layer. Must be between 0 and 1

optimizer

which optimizer used for training the network, for more adjustments to optimizer see config_optimizer

lr

learning rate given to optimizer

batchsize

number of samples that are used to calculate one learning rate step, default is 10% of the training data

burnin

training is aborted if the trainings loss is not below the baseline loss after burnin epochs

baseloss

baseloss, if null baseloss corresponds to intercept only models

shuffle

if TRUE, data in each batch gets reshuffled every epoch

epochs

epochs the training goes on for

bootstrap

bootstrap neural network or not, numeric corresponds to number of bootstrap samples

bootstrap_parallel

parallelize (CPU) bootstrapping

plot

plot training loss

verbose

print training and validation loss of epochs

lr_scheduler

learning rate scheduler created with config_lr_scheduler

custom_parameters

List of parameters/variables to be optimized. Can be used in a custom loss function. See Vignette for example.

device

device on which network should be trained on. mps correspond to M1/M2 GPU devices.

early_stopping

if set to integer, training will stop if loss has gotten higher for defined number of epochs in a row, will use validation loss is available.

tuning

tuning options created with config_tuning

X

Feature matrix or data.frame, alternative data interface

Y

Response vector, factor, matrix or data.frame, alternative data interface

Value

an S3 object of class "cito.dnn" is returned. It is a list containing everything there is to know about the model and its training process. The list consists of the following attributes:

net

An object of class "nn_sequential" "nn_module", originates from the torch package and represents the core object of this workflow.

call

The original function call

loss

A list which contains relevant information for the target variable and the used loss function

data

Contains data used for training the model

weigths

List of weights for each training epoch

use_model_epoch

Integer, which defines which model from which training epoch should be used for prediction. 1 = best model, 2 = last model

loaded_model_epoch

Integer, shows which model from which epoch is loaded currently into model$net.

model_properties

A list of properties of the neural network, contains number of input nodes, number of output nodes, size of hidden layers, activation functions, whether bias is included and if dropout layers are included.

training_properties

A list of all training parameters that were used the last time the model was trained. It consists of learning rate, information about an learning rate scheduler, information about the optimizer, number of epochs, whether early stopping was used, if plot was active, lambda and alpha for L1/L2 regularization, batchsize, shuffle, was the data set split into validation and training, which formula was used for training and at which epoch did the training stop.

losses

A data.frame containing training and validation losses of each epoch

Activation functions

Supported activation functions: "relu", "leaky_relu", "tanh", "elu", "rrelu", "prelu", "softplus", "celu", "selu", "gelu", "relu6", "sigmoid", "softsign", "hardtanh", "tanhshrink", "softshrink", "hardshrink", "log_sigmoid"

Loss functions / Likelihoods

We support loss functions and likelihoods for different tasks:

NameExplanationExample / Task
msemean squared errorRegression, predicting continuous values
maemean absolute errorRegression, predicting continuous values
softmaxcategorical cross entropyMulti-class, species classification
cross-entropycategorical cross entropyMulti-class, species classification
gaussianNormal likelihoodRegression, residual error is also estimated (similar to stats::lm())
binomialBinomial likelihoodClassification/Logistic regression, mortality
poissonPoisson likelihoodRegression, count data, e.g. species abundances
nbinomNegative binomial likelihoodRegression, count data with dispersion parameter
mvpmultivariate probit modeljoint species distribution model, multi species (presence absence)

Training and convergence of neural networks

Ensuring convergence can be tricky when training neural networks. Their training is sensitive to a combination of the learning rate (how much the weights are updated in each optimization step), the batch size (a random subset of the data is used in each optimization step), and the number of epochs (number of optimization steps). Typically, the learning rate should be decreased with the size of the neural networks (depth of the network and width of the hidden layers). We provide a baseline loss (intercept only model) that can give hints about an appropriate learning rate:

Learning rates

If the training loss of the model doesn't fall below the baseline loss, the learning rate is either too high or too low. If this happens, try higher and lower learning rates.

A common strategy is to try (manually) a few different learning rates to see if the learning rate is on the right scale.

See the troubleshooting vignette (vignette("B-Training_neural_networks")) for more help on training and debugging neural networks.

Finding the right architecture

As with the learning rate, there is no definitive guide to choosing the right architecture for the right task. However, there are some general rules/recommendations: In general, wider, and deeper neural networks can improve generalization - but this is a double-edged sword because it also increases the risk of overfitting. So, if you increase the width and depth of the network, you should also add regularization (e.g., by increasing the lambda parameter, which corresponds to the regularization strength). Furthermore, in Pichler & Hartig, 2023, we investigated the effects of the hyperparameters on the prediction performance as a function of the data size. For example, we found that the selu activation function outperforms relu for small data sizes (<100 observations).

We recommend starting with moderate sizes (like the defaults), and if the model doesn't generalize/converge, try larger networks along with a regularization that helps minimize the risk of overfitting (see vignette("B-Training_neural_networks") ).

Overfitting

Overfitting means that the model fits the training data well, but generalizes poorly to new observations. We can use the validation argument to detect overfitting. If the validation loss starts to increase again at a certain point, it often means that the models are starting to overfit your training data:

Overfitting

Solutions:

  • Re-train with epochs = point where model started to overfit

  • Early stopping, stop training when model starts to overfit, can be specified using the early_stopping=… argument

  • Use regularization (dropout or elastic-net, see next section)

Regularization

Elastic Net regularization combines the strengths of L1 (Lasso) and L2 (Ridge) regularization. It introduces a penalty term that encourages sparse weight values while maintaining overall weight shrinkage. By controlling the sparsity of the learned model, Elastic Net regularization helps avoid overfitting while allowing for meaningful feature selection. We advise using elastic net (e.g. lambda = 0.001 and alpha = 0.2).

Dropout regularization helps prevent overfitting by randomly disabling a portion of neurons during training. This technique encourages the network to learn more robust and generalized representations, as it prevents individual neurons from relying too heavily on specific input patterns. Dropout has been widely adopted as a simple yet effective regularization method in deep learning.

By utilizing these regularization methods in your neural network training with the cito package, you can improve generalization performance and enhance the network's ability to handle unseen data. These techniques act as valuable tools in mitigating overfitting and promoting more robust and reliable model performance.

Uncertainty

We can use bootstrapping to generate uncertainties for all outputs. Bootstrapping can be enabled by setting bootstrap = ... to the number of bootstrap samples to be used. Note, however, that the computational cost can be excessive.

In some cases it may be worthwhile to parallelize bootstrapping, for example if you have a GPU and the neural network is small. Parallelization for bootstrapping can be enabled by setting the bootstrap_parallel = ... argument to the desired number of calls to run in parallel.

Custom Optimizer and Learning Rate Schedulers

When training a network, you have the flexibility to customize the optimizer settings and learning rate scheduler to optimize the learning process. In the cito package, you can initialize these configurations using the config_lr_scheduler and config_optimizer functions.

config_lr_scheduler allows you to define a specific learning rate scheduler that controls how the learning rate changes over time during training. This is beneficial in scenarios where you want to adaptively adjust the learning rate to improve convergence or avoid getting stuck in local optima.

Similarly, the config_optimizer function enables you to specify the optimizer for your network. Different optimizers, such as stochastic gradient descent (SGD), Adam, or RMSprop, offer various strategies for updating the network's weights and biases during training. Choosing the right optimizer can significantly impact the training process and the final performance of your neural network.

Hyperparameter tuning

We have implemented experimental support for hyperparameter tuning. We can mark hyperparameters that should be tuned by cito by setting their values to tune(), for example dnn (..., lr = tune(). tune() is a function that creates a range of random values for the given hyperparameter. You can change the maximum and minimum range of the potential hyperparameters or pass custom values to the tune(values = c(....)) function. The following table lists the hyperparameters that can currently be tuned:

HyperparameterExampleDetails
hiddendnn(…,hidden=tune(10, 20, fixed=’depth’))Depth and width can be both tuned or only one of them, if both of them should be tuned, vectors for lower and upper #' boundaries must be provided (first = number of nodes)
biasdnn(…, bias=tune())Should the bias be turned on or off for all hidden layers
lambdadnn(…, lambda = tune(0.0001, 0.1))lambda will be tuned within the range (0.0001, 0.1)
alphadnn(…, lambda = tune(0.2, 0.4))alpha will be tuned within the range (0.2, 0.4)
activationdnn(…, activation = tune())activation functions of the hidden layers will be tuned
dropoutdnn(…, dropout = tune())Dropout rate will be tuned (globally for all layers)
lrdnn(…, lr = tune())Learning rate will be tuned
batchsizednn(…, batchsize = tune())batch size will be tuned
epochsdnn(…, batchsize = tune())batchsize will be tuned

The hyperparameters are tuned by random search (i.e., random values for the hyperparameters within a specified range) and by cross-validation. The exact tuning regime can be specified with config_tuning.

Note that hyperparameter tuning can be expensive. We have implemented an option to parallelize hyperparameter tuning, including parallelization over one or more GPUs (the hyperparameter evaluation is parallelized, not the CV). This can be especially useful for small models. For example, if you have 4 GPUs, 20 CPU cores, and 20 steps (random samples from the random search), you could run dnn(..., device="cuda",lr = tune(), batchsize=tune(), tuning=config_tuning(parallel=20, NGPU=4), which will distribute 20 model fits across 4 GPUs, so that each GPU will process 5 models (in parallel).

As this is an experimental feature, we welcome feature requests and bug reports on our github site.

For the custom values, all hyperparameters except for the hidden layers require a vector of values. Hidden layers expect a two-column matrix where the first column is the number of hidden nodes and the second column corresponds to the number of hidden layers.

How neural networks work

In Multilayer Perceptron (MLP) networks, each neuron is connected to every neuron in the previous layer and every neuron in the subsequent layer. The value of each neuron is computed using a weighted sum of the outputs from the previous layer, followed by the application of an activation function. Specifically, the value of a neuron is calculated as the weighted sum of the outputs of the neurons in the previous layer, combined with a bias term. This sum is then passed through an activation function, which introduces non-linearity into the network. The calculated value of each neuron becomes the input for the neurons in the next layer, and the process continues until the output layer is reached. The choice of activation function and the specific weight values determine the network's ability to learn and approximate complex relationships between inputs and outputs.

Therefore the value of each neuron can be calculated using: \( a (\sum_j{ w_j * a_j})\). Where \(w_j\) is the weight and \(a_j\) is the value from neuron j to the current one. a() is the activation function, e.g. \( relu(x) = max(0,x)\)

Training on graphic cards

If you have an NVIDIA CUDA-enabled device and have installed the CUDA toolkit version 11.3 and cuDNN 8.4, you can take advantage of GPU acceleration for training your neural networks. It is crucial to have these specific versions installed, as other versions may not be compatible. For detailed installation instructions and more information on utilizing GPUs for training, please refer to the mlverse: 'torch' documentation.

Note: GPU training is optional, and the package can still be used for training on CPU even without CUDA and cuDNN installations.

Author

Christian Amesoeder, Maximilian Pichler

Examples

# \donttest{
if(torch::torch_is_installed()){
library(cito)

# Example workflow in cito

## Build and train  Network
### softmax is used for multi-class responses (e.g., Species)
nn.fit<- dnn(Species~., data = datasets::iris, loss = "softmax")

## The training loss is below the baseline loss but at the end of the
## training the loss was still decreasing, so continue training for another 50
## epochs
nn.fit <- continue_training(nn.fit, epochs = 50L)

# Sturcture of Neural Network
print(nn.fit)

# Plot Neural Network
plot(nn.fit)
## 4 Input nodes (first layer) because of 4 features
## 3 Output nodes (last layer) because of 3 response species (one node for each
## level in the response variable).
## The layers between the input and output layer are called hidden layers (two
## of them)

## We now want to understand how the predictions are made, what are the
## important features? The summary function automatically calculates feature
## importance (the interpretation is similar to an anova) and calculates
## average conditional effects that are similar to linear effects:
summary(nn.fit)

## To visualize the effect (response-feature effect), we can use the ALE and
## PDP functions

# Partial dependencies
PDP(nn.fit, variable = "Petal.Length")

# Accumulated local effect plots
ALE(nn.fit, variable = "Petal.Length")



# Per se, it is difficult to get confidence intervals for our xAI metrics (or
# for the predictions). But we can use bootstrapping to obtain uncertainties
# for all cito outputs:
## Re-fit the neural network with bootstrapping
nn.fit<- dnn(Species~.,
             data = datasets::iris,
             loss = "softmax",
             epochs = 150L,
             verbose = FALSE,
             bootstrap = 20L)
## convergence can be tested via the analyze_training function
analyze_training(nn.fit)

## Summary for xAI metrics (can take some time):
summary(nn.fit)
## Now with standard errors and p-values
## Note: Take the p-values with a grain of salt! We do not know yet if they are
## correct (e.g. if you use regularization, they are likely conservative == too
## large)

## Predictions with bootstrapping:
dim(predict(nn.fit))
## predictions are by default averaged (over the bootstrap samples)



# Hyperparameter tuning (experimental feature)
hidden_values = matrix(c(5, 2,
                         4, 2,
                         10,2,
                         15,2), 4, 2, byrow = TRUE)
## Potential architectures we want to test, first column == number of nodes
print(hidden_values)

nn.fit = dnn(Species~.,
             data = iris,
             epochs = 30L,
             loss = "softmax",
             hidden = tune(values = hidden_values),
             lr = tune(0.00001, 0.1) # tune lr between range 0.00001 and 0.1
             )
## Tuning results:
print(nn.fit$tuning)

# test = Inf means that tuning was cancelled after only one fit (within the CV)


# Advanced: Custom loss functions and additional parameters
## Normal Likelihood with sd parameter:
custom_loss = function(pred, true) {
  logLik = torch::distr_normal(pred,
                               scale = torch::nnf_relu(scale)+
                                 0.001)$log_prob(true)
  return(-logLik$mean())
}

nn.fit<- dnn(Sepal.Length~.,
             data = datasets::iris,
             loss = custom_loss,
             verbose = FALSE,
             custom_parameters = list(scale = 1.0)
)
nn.fit$parameter$scale

## Multivariate normal likelihood with parametrized covariance matrix
## Sigma = L*L^t + D
## Helper function to build covariance matrix
create_cov = function(LU, Diag) {
  return(torch::torch_matmul(LU, LU$t()) + torch::torch_diag(Diag$exp()+0.01))
}

custom_loss_MVN = function(true, pred) {
  Sigma = create_cov(SigmaPar, SigmaDiag)
  logLik = torch::distr_multivariate_normal(pred,
                                            covariance_matrix = Sigma)$
    log_prob(true)
  return(-logLik$mean())
}


nn.fit<- dnn(cbind(Sepal.Length, Sepal.Width, Petal.Length)~.,
             data = datasets::iris,
             lr = 0.01,
             verbose = FALSE,
             loss = custom_loss_MVN,
             custom_parameters =
               list(SigmaDiag =  rep(0, 3),
                    SigmaPar = matrix(rnorm(6, sd = 0.001), 3, 2))
)
as.matrix(create_cov(nn.fit$loss$parameter$SigmaPar,
                     nn.fit$loss$parameter$SigmaDiag))

}
#> Loss at epoch 1: 1.149090, lr: 0.01000

#> Loss at epoch 2: 0.949007, lr: 0.01000
#> Loss at epoch 3: 0.795534, lr: 0.01000
#> Loss at epoch 4: 0.698546, lr: 0.01000
#> Loss at epoch 5: 0.623767, lr: 0.01000
#> Loss at epoch 6: 0.569265, lr: 0.01000
#> Loss at epoch 7: 0.516603, lr: 0.01000
#> Loss at epoch 8: 0.486555, lr: 0.01000
#> Loss at epoch 9: 0.443597, lr: 0.01000
#> Loss at epoch 10: 0.427916, lr: 0.01000
#> Loss at epoch 11: 0.402713, lr: 0.01000
#> Loss at epoch 12: 0.387409, lr: 0.01000
#> Loss at epoch 13: 0.373584, lr: 0.01000
#> Loss at epoch 14: 0.361729, lr: 0.01000
#> Loss at epoch 15: 0.349155, lr: 0.01000
#> Loss at epoch 16: 0.328656, lr: 0.01000
#> Loss at epoch 17: 0.304835, lr: 0.01000
#> Loss at epoch 18: 0.295993, lr: 0.01000
#> Loss at epoch 19: 0.285383, lr: 0.01000
#> Loss at epoch 20: 0.276205, lr: 0.01000
#> Loss at epoch 21: 0.256072, lr: 0.01000
#> Loss at epoch 22: 0.257092, lr: 0.01000
#> Loss at epoch 23: 0.251295, lr: 0.01000
#> Loss at epoch 24: 0.239248, lr: 0.01000
#> Loss at epoch 25: 0.232020, lr: 0.01000
#> Loss at epoch 26: 0.214081, lr: 0.01000
#> Loss at epoch 27: 0.209653, lr: 0.01000
#> Loss at epoch 28: 0.211463, lr: 0.01000
#> Loss at epoch 29: 0.195139, lr: 0.01000
#> Loss at epoch 30: 0.195001, lr: 0.01000
#> Loss at epoch 31: 0.178946, lr: 0.01000
#> Loss at epoch 32: 0.187315, lr: 0.01000
#> Loss at epoch 33: 0.185912, lr: 0.01000
#> Loss at epoch 34: 0.170181, lr: 0.01000
#> Loss at epoch 35: 0.162988, lr: 0.01000
#> Loss at epoch 36: 0.164903, lr: 0.01000
#> Loss at epoch 37: 0.156705, lr: 0.01000
#> Loss at epoch 38: 0.170339, lr: 0.01000
#> Loss at epoch 39: 0.157090, lr: 0.01000
#> Loss at epoch 40: 0.148665, lr: 0.01000
#> Loss at epoch 41: 0.148732, lr: 0.01000
#> Loss at epoch 42: 0.139814, lr: 0.01000
#> Loss at epoch 43: 0.135497, lr: 0.01000
#> Loss at epoch 44: 0.135846, lr: 0.01000
#> Loss at epoch 45: 0.146827, lr: 0.01000
#> Loss at epoch 46: 0.134778, lr: 0.01000
#> Loss at epoch 47: 0.128255, lr: 0.01000
#> Loss at epoch 48: 0.128266, lr: 0.01000
#> Loss at epoch 49: 0.123065, lr: 0.01000
#> Loss at epoch 50: 0.131524, lr: 0.01000
#> Loss at epoch 51: 0.152062, lr: 0.01000
#> Loss at epoch 52: 0.123343, lr: 0.01000
#> Loss at epoch 53: 0.112698, lr: 0.01000
#> Loss at epoch 54: 0.107688, lr: 0.01000
#> Loss at epoch 55: 0.108624, lr: 0.01000
#> Loss at epoch 56: 0.110669, lr: 0.01000
#> Loss at epoch 57: 0.109462, lr: 0.01000
#> Loss at epoch 58: 0.108406, lr: 0.01000
#> Loss at epoch 59: 0.123964, lr: 0.01000
#> Loss at epoch 60: 0.103912, lr: 0.01000
#> Loss at epoch 61: 0.106795, lr: 0.01000
#> Loss at epoch 62: 0.103615, lr: 0.01000
#> Loss at epoch 63: 0.111234, lr: 0.01000
#> Loss at epoch 64: 0.099051, lr: 0.01000
#> Loss at epoch 65: 0.095832, lr: 0.01000
#> Loss at epoch 66: 0.101918, lr: 0.01000
#> Loss at epoch 67: 0.097386, lr: 0.01000
#> Loss at epoch 68: 0.104442, lr: 0.01000
#> Loss at epoch 69: 0.096646, lr: 0.01000
#> Loss at epoch 70: 0.103415, lr: 0.01000
#> Loss at epoch 71: 0.096700, lr: 0.01000
#> Loss at epoch 72: 0.089756, lr: 0.01000
#> Loss at epoch 73: 0.101084, lr: 0.01000
#> Loss at epoch 74: 0.094696, lr: 0.01000
#> Loss at epoch 75: 0.095601, lr: 0.01000
#> Loss at epoch 76: 0.086005, lr: 0.01000
#> Loss at epoch 77: 0.098505, lr: 0.01000
#> Loss at epoch 78: 0.094670, lr: 0.01000
#> Loss at epoch 79: 0.087046, lr: 0.01000
#> Loss at epoch 80: 0.107928, lr: 0.01000
#> Loss at epoch 81: 0.087984, lr: 0.01000
#> Loss at epoch 82: 0.099143, lr: 0.01000
#> Loss at epoch 83: 0.085078, lr: 0.01000
#> Loss at epoch 84: 0.084471, lr: 0.01000
#> Loss at epoch 85: 0.102366, lr: 0.01000
#> Loss at epoch 86: 0.086834, lr: 0.01000
#> Loss at epoch 87: 0.089890, lr: 0.01000
#> Loss at epoch 88: 0.084399, lr: 0.01000
#> Loss at epoch 89: 0.085281, lr: 0.01000
#> Loss at epoch 90: 0.092049, lr: 0.01000
#> Loss at epoch 91: 0.088786, lr: 0.01000
#> Loss at epoch 92: 0.085713, lr: 0.01000
#> Loss at epoch 93: 0.083635, lr: 0.01000
#> Loss at epoch 94: 0.111323, lr: 0.01000
#> Loss at epoch 95: 0.096476, lr: 0.01000
#> Loss at epoch 96: 0.075609, lr: 0.01000
#> Loss at epoch 97: 0.088243, lr: 0.01000
#> Loss at epoch 98: 0.076815, lr: 0.01000
#> Loss at epoch 99: 0.083597, lr: 0.01000
#> Loss at epoch 100: 0.089081, lr: 0.01000
#> Loss at epoch 101: 0.087389, lr: 0.01000

#> Loss at epoch 102: 0.091757, lr: 0.01000
#> Loss at epoch 103: 0.087825, lr: 0.01000
#> Loss at epoch 104: 0.094488, lr: 0.01000
#> Loss at epoch 105: 0.082026, lr: 0.01000
#> Loss at epoch 106: 0.106554, lr: 0.01000
#> Loss at epoch 107: 0.087043, lr: 0.01000
#> Loss at epoch 108: 0.086886, lr: 0.01000
#> Loss at epoch 109: 0.074416, lr: 0.01000
#> Loss at epoch 110: 0.075379, lr: 0.01000
#> Loss at epoch 111: 0.070000, lr: 0.01000
#> Loss at epoch 112: 0.088540, lr: 0.01000
#> Loss at epoch 113: 0.066166, lr: 0.01000
#> Loss at epoch 114: 0.073379, lr: 0.01000
#> Loss at epoch 115: 0.074613, lr: 0.01000
#> Loss at epoch 116: 0.067626, lr: 0.01000
#> Loss at epoch 117: 0.088594, lr: 0.01000
#> Loss at epoch 118: 0.081651, lr: 0.01000
#> Loss at epoch 119: 0.102858, lr: 0.01000
#> Loss at epoch 120: 0.067415, lr: 0.01000
#> Loss at epoch 121: 0.069106, lr: 0.01000
#> Loss at epoch 122: 0.068705, lr: 0.01000
#> Loss at epoch 123: 0.075654, lr: 0.01000
#> Loss at epoch 124: 0.088664, lr: 0.01000
#> Loss at epoch 125: 0.086662, lr: 0.01000
#> Loss at epoch 126: 0.088660, lr: 0.01000
#> Loss at epoch 127: 0.086304, lr: 0.01000
#> Loss at epoch 128: 0.076725, lr: 0.01000
#> Loss at epoch 129: 0.084493, lr: 0.01000
#> Loss at epoch 130: 0.076521, lr: 0.01000
#> Loss at epoch 131: 0.087787, lr: 0.01000
#> Loss at epoch 132: 0.084570, lr: 0.01000
#> Loss at epoch 133: 0.078460, lr: 0.01000
#> Loss at epoch 134: 0.094333, lr: 0.01000
#> Loss at epoch 135: 0.075914, lr: 0.01000
#> Loss at epoch 136: 0.064437, lr: 0.01000
#> Loss at epoch 137: 0.070773, lr: 0.01000
#> Loss at epoch 138: 0.083005, lr: 0.01000
#> Loss at epoch 139: 0.081682, lr: 0.01000
#> Loss at epoch 140: 0.061150, lr: 0.01000
#> Loss at epoch 141: 0.087631, lr: 0.01000
#> Loss at epoch 142: 0.080926, lr: 0.01000
#> Loss at epoch 143: 0.065549, lr: 0.01000
#> Loss at epoch 144: 0.087916, lr: 0.01000
#> Loss at epoch 145: 0.079513, lr: 0.01000
#> Loss at epoch 146: 0.075343, lr: 0.01000
#> Loss at epoch 147: 0.110408, lr: 0.01000
#> Loss at epoch 148: 0.071936, lr: 0.01000
#> Loss at epoch 149: 0.085683, lr: 0.01000
#> Loss at epoch 150: 0.066637, lr: 0.01000
#> dnn(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + 
#>     Petal.Width - 1, data = datasets::iris, loss = "softmax")
#> An `nn_module` containing 2,953 parameters.
#> 
#> ── Modules ─────────────────────────────────────────────────────────────────────
#> • 0: <nn_linear> #250 parameters
#> • 1: <nn_selu> #0 parameters
#> • 2: <nn_linear> #2,550 parameters
#> • 3: <nn_selu> #0 parameters
#> • 4: <nn_linear> #153 parameters

#> Number of Neighborhoods reduced to 8
#> Number of Neighborhoods reduced to 8
#> Number of Neighborhoods reduced to 8


#>      [,1] [,2]
#> [1,]    5    2
#> [2,]    4    2
#> [3,]   10    2
#> [4,]   15    2
#> Starting hyperparameter tuning...
#> Fitting final model...
#> # A tibble: 10 × 6
#>    steps  test train models hidden         lr
#>    <int> <dbl> <dbl> <lgl>  <list>      <dbl>
#>  1     1  23.8     0 NA     <dbl [2]> 0.0543 
#>  2     2  26.4     0 NA     <dbl [2]> 0.0405 
#>  3     3  17.5     0 NA     <dbl [2]> 0.0602 
#>  4     4  63.9     0 NA     <dbl [2]> 0.00915
#>  5     5 108.      0 NA     <dbl [2]> 0.00589
#>  6     6  52.0     0 NA     <dbl [2]> 0.00984
#>  7     7 116.      0 NA     <dbl [2]> 0.00385
#>  8     8  27.1     0 NA     <dbl [2]> 0.0436 
#>  9     9  24.9     0 NA     <dbl [2]> 0.0412 
#> 10    10  18.3     0 NA     <dbl [2]> 0.0575 


#>            [,1]       [,2]       [,3]
#> [1,] 0.30124742 0.03514386 0.07208966
#> [2,] 0.03514386 0.15174784 0.02518149
#> [3,] 0.07208966 0.02518149 0.20623882
# }