Blog Post 3 - Training and Hyperparameter Tuning
Alex Berry, Jason Chan, Hyunjoon Lee
Brown University Data Science Initiative
DATA 2040: Deep Learning
March 3, 2020
Now that preprocessing is done, it is time to build the model. To build our model, we migrated our project to Google Cloud Platform (GCP), using 1 NVIDIA P100 GPU, 16 CPUS and 60 GB of RAM. Migrating to GCP and using its environment allowed us to train our model a lot faster than our local CPU. It also allows us to deal with kernel crashing due to lack of memory issues. We started with building the baseline model then proceeded to hyperparameter tuning.
Baseline Model
We took the baseline model from Code Ninja’s Bengali Graphemes: Starter EDA+ Multi Output CNN.
The baseline model consisted of 4 convolutional modules. Each module had 4 convolutional layers, batch normalization, max pull, a convolutional layer, and dropout. Below is the code of the first convolutional module.
model = Conv2D(filters= 32, kernel_size=(3, 3), padding='SAME', activation='relu',
input_shape=(IMG_X_SIZE, IMG_Y_SIZE, 1))(inputs)
model = Conv2D(filters= 32, kernel_size=(3, 3), padding='SAME', activation='relu')(model)
model = Conv2D(filters= 32, kernel_size=(3, 3), padding='SAME', activation='relu')(model)
model = Conv2D(filters= 32, kernel_size=(3, 3), padding='SAME', activation='relu')(model)
model = BatchNormalization(momentum=0.15)(model)
model = MaxPool2D(pool_size=(2, 2))(model)
model = Conv2D(filters= 32, kernel_size=(5, 5), padding='SAME', activation='relu')(model)
model = Dropout(rate=0.3)(model)
Number of filters started at 32 and doubled for each modules (32, 64, 128, 256). The number of filters were increased to detect/capture more detailed and deeper feature patterns of the image data for each proceeding convolutional module.
Hyperparameters:
- kernel_size = (3, 3)
- The height and width of our convolution was set to (3, 3).
- padding = ‘SAME’
- We apply padding such that the input images are fully covered by our filter and specified stride. Because we use default stride <(1,1)>, the outputs have the same dimensions as inputs (hence SAME).
- activation = ‘relu’
- We use ReLU activation function.
- rate = 0.3
- Dropout is applied to fifth convolutional layer of the module, and the dropout rate is set to 0.3.
Below is the architecture for entire baseline model.
The performance of the baseline model was
Weighted Avg Accuracy | Root Accuracy | Vowel Accuracy | Consonant Accuracy | |
---|---|---|---|---|
Baseline Model | 95.87% | 93.61% | 98.10% | 98.14% |
Hyperparameter Tuning
With a baseline model with fairly high performance, the next step was to tune hyperparameters to try to increase our score. For our baseline, it took about 400 seconds to train one epoch, and we ran our model for 20 epochs which took around 1 hour 20 minutes. We decreased the number of epochs from 20 to 10. Still, we needed to narrow down the range of the hyperparameters. Below were the list of hyperparameters we ultimately tuned.
- Activations (for convolutional layers) = [“tanh”, “relu”]
- Dropout probability (for all layers) = [0.20, 0.40]
- Optimizers (for whole model) = [“nadam”, “adam”]
- Batch Sizes (for whole model) = [128, 256]
Notes
1) We initially decided to tune our model with different learning rate scheduler (power, exponential scheduling). However, early on during hyperparameter tuning, we found that the exponential learning rate scheduler was performing very poorly. Therefore, we decided to set power scheduler as our learning rate scheduler, and instead decided to tune different hyperparameter: batch sizes. We had the intuition that smaller batch sizes in convolutional layers may perform better (see related article).
2) We initially tried keras hyperparameter tuner library (keras-tuner
) but was unable to get the code to work. Instead, we coded our own grid search with basic for-loops.)
Results
Overall, only one model performed better than the base line model (very slightly). The weighted average validation accuracy of the best model was 94.83% and validation accuracy of grapheme root was 92.04%. The hyperparameters of the best model (model_8) were activation = relu
, dropout_prob = 0.2
, optimzer = adam
, batch_size = 256
.
Here are plots of the validation accuracy of a few of the epoch cycles for a select number of hyperparameter trials:
Model 0
Activation is
tanh
, dropout probability is 0.2
, optimizer is adam
, and batch size is 256
.
Model 12
Activation is
relu
, dropout probability is 0.4
, optimizer is adam
, and batch size is 256
.
You might be curious why we are missing model 6, 7, 9, 10, and 11. The reason why we made the decision to terminate the hyperparameter early was mainly due to computation/storage limitations. Even with 60GB of RAM on GCP, 16 CPUs, and 1 NVIDIA P100 GPU, with each model taking 40 minutes to train (16 models total, approx 10 hours total), we kept running out of memory.
After analyzing the first six models, we observed that Nadam optimzer performed very poorly and that batch size of 256 and lower dropout rate performed better, especialy on the validation set, which meant that they were not overfitting. Observing these clear patterns early on, we decided to stop the tuning. Instead, because we wanted to compare ‘tanh’ and ‘relu’, we trained two additional models on hyperparameters (relu, 0.2, adam, 256) and (relu, 0.4, adam, 256) and compared them to (tanh, 0.2, adam, 256) and (tanh, 0.4, adam, 256), respectively.
Final Model
We were not satisfied with the result of hyperparameter tuning. Therefore, we decided to apply more adjustments to our final model. Setting our hyperparameters to activation = relu
, dropout_prob = 0.2
, optimzer = adam
, batch_size = 256
, we decided to add one more convolutional module with filters = 16 and set number of epochs to 30 during training of our final model.
Below is the architecture for the final model:
Model: "final_model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_2 (InputLayer) (None, 87, 106, 1) 0
__________________________________________________________________________________________________
conv2d_31 (Conv2D) (None, 87, 106, 32) 320 input_2[0][0]
_____________________________c_____________________________________________________________________
conv2d_32 (Conv2D) (None, 87, 106, 32) 9248 conv2d_31[0][0]
__________________________________________________________________________________________________
conv2d_33 (Conv2D) (None, 87, 106, 32) 9248 conv2d_32[0][0]
__________________________________________________________________________________________________
conv2d_34 (Conv2D) (None, 87, 106, 32) 9248 conv2d_33[0][0]
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 87, 106, 32) 128 conv2d_34[0][0]
__________________________________________________________________________________________________
max_pooling2d_7 (MaxPooling2D) (None, 43, 53, 32) 0 batch_normalization_10[0][0]
__________________________________________________________________________________________________
conv2d_35 (Conv2D) (None, 43, 53, 32) 25632 max_pooling2d_7[0][0]
__________________________________________________________________________________________________
dropout_8 (Dropout) (None, 43, 53, 32) 0 conv2d_35[0][0]
__________________________________________________________________________________________________
conv2d_36 (Conv2D) (None, 43, 53, 64) 18496 dropout_8[0][0]
__________________________________________________________________________________________________
conv2d_37 (Conv2D) (None, 43, 53, 64) 36928 conv2d_36[0][0]
__________________________________________________________________________________________________
conv2d_38 (Conv2D) (None, 43, 53, 64) 36928 conv2d_37[0][0]
__________________________________________________________________________________________________
conv2d_39 (Conv2D) (None, 43, 53, 64) 36928 conv2d_38[0][0]
__________________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 43, 53, 64) 256 conv2d_39[0][0]
__________________________________________________________________________________________________
max_pooling2d_8 (MaxPooling2D) (None, 21, 26, 64) 0 batch_normalization_11[0][0]
__________________________________________________________________________________________________
conv2d_40 (Conv2D) (None, 21, 26, 64) 102464 max_pooling2d_8[0][0]
__________________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 21, 26, 64) 256 conv2d_40[0][0]
__________________________________________________________________________________________________
dropout_9 (Dropout) (None, 21, 26, 64) 0 batch_normalization_12[0][0]
__________________________________________________________________________________________________
conv2d_41 (Conv2D) (None, 21, 26, 128) 73856 dropout_9[0][0]
__________________________________________________________________________________________________
conv2d_42 (Conv2D) (None, 21, 26, 128) 147584 conv2d_41[0][0]
__________________________________________________________________________________________________
conv2d_43 (Conv2D) (None, 21, 26, 128) 147584 conv2d_42[0][0]
__________________________________________________________________________________________________
conv2d_44 (Conv2D) (None, 21, 26, 128) 147584 conv2d_43[0][0]
__________________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 21, 26, 128) 512 conv2d_44[0][0]
__________________________________________________________________________________________________
max_pooling2d_9 (MaxPooling2D) (None, 10, 13, 128) 0 batch_normalization_13[0][0]
__________________________________________________________________________________________________
conv2d_45 (Conv2D) (None, 10, 13, 128) 409728 max_pooling2d_9[0][0]
__________________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 10, 13, 128) 512 conv2d_45[0][0]
__________________________________________________________________________________________________
dropout_10 (Dropout) (None, 10, 13, 128) 0 batch_normalization_14[0][0]
__________________________________________________________________________________________________
conv2d_46 (Conv2D) (None, 10, 13, 256) 295168 dropout_10[0][0]
__________________________________________________________________________________________________
conv2d_47 (Conv2D) (None, 10, 13, 256) 590080 conv2d_46[0][0]
__________________________________________________________________________________________________
conv2d_48 (Conv2D) (None, 10, 13, 256) 590080 conv2d_47[0][0]
__________________________________________________________________________________________________
conv2d_49 (Conv2D) (None, 10, 13, 256) 590080 conv2d_48[0][0]
__________________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 10, 13, 256) 1024 conv2d_49[0][0]
__________________________________________________________________________________________________
max_pooling2d_10 (MaxPooling2D) (None, 5, 6, 256) 0 batch_normalization_15[0][0]
__________________________________________________________________________________________________
conv2d_50 (Conv2D) (None, 5, 6, 256) 1638656 max_pooling2d_10[0][0]
__________________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 5, 6, 256) 1024 conv2d_50[0][0]
__________________________________________________________________________________________________
dropout_11 (Dropout) (None, 5, 6, 256) 0 batch_normalization_16[0][0]
__________________________________________________________________________________________________
flatten_2 (Flatten) (None, 7680) 0 dropout_11[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None, 1024) 7865344 flatten_2[0][0]
__________________________________________________________________________________________________
dropout_12 (Dropout) (None, 1024) 0 dense_3[0][0]
__________________________________________________________________________________________________
dense_4 (Dense) (None, 512) 524800 dropout_12[0][0]
__________________________________________________________________________________________________
dense_root (Dense) (None, 168) 86184 dense_4[0][0]
__________________________________________________________________________________________________
dense_vowel (Dense) (None, 11) 5643 dense_4[0][0]
__________________________________________________________________________________________________
dense_consonant (Dense) (None, 7) 3591 dense_4[0][0]
==================================================================================================
Total params: 13,405,114
Trainable params: 13,403,258
Non-trainable params: 1,856
__________________________________________________________________________________________________
Below is the validation accuracy of the final model over the 30 epochs:
Final results from this tuning is that we realized ReLU is a better activation function than tanh, it seems the lower dropout probability leads to a slight increase in the root accuracy, the adam optimizer looks a lot better than nadam, and a batch size of 256 is better than 128.
Therefore, we choose dropout = 0.2, we add one more convolutional module with filters=16, and run our final model for more 30 epochs.
After adding an additional convolutional module with filters=16 and training it for 30 epochs, we got the following results,
Weighted Avg Accuracy | Root Accuracy | Vowel Accuracy | Consonant Accuracy | |
---|---|---|---|---|
Final Model (before MC Dropout) | 96.35% | 94.34% | 98.36% | 98.38% |
Next Steps
The immediate next extension we intend to try is a Monte Carlo Dropout approach to see if the performance increases even further.
In addition, we are interested in restructuring the network’s architecture by emulating some of the innovations of famous architectures like GoogLeNet (i.e. inception modules) or by extending the network to many more layers while using residual blocks.