Blog Post 3 - Training and Hyperparameter Tuning

Alex Berry, Jason Chan, Hyunjoon Lee

Brown University Data Science Initiative
DATA 2040: Deep Learning
March 3, 2020

Now that preprocessing is done, it is time to build the model. To build our model, we migrated our project to Google Cloud Platform (GCP), using 1 NVIDIA P100 GPU, 16 CPUS and 60 GB of RAM. Migrating to GCP and using its environment allowed us to train our model a lot faster than our local CPU. It also allows us to deal with kernel crashing due to lack of memory issues. We started with building the baseline model then proceeded to hyperparameter tuning.

Baseline Model

We took the baseline model from Code Ninja’s Bengali Graphemes: Starter EDA+ Multi Output CNN.

The baseline model consisted of 4 convolutional modules. Each module had 4 convolutional layers, batch normalization, max pull, a convolutional layer, and dropout. Below is the code of the first convolutional module.

model = Conv2D(filters= 32, kernel_size=(3, 3), padding='SAME', activation='relu', 
               input_shape=(IMG_X_SIZE, IMG_Y_SIZE, 1))(inputs)
model = Conv2D(filters= 32, kernel_size=(3, 3), padding='SAME', activation='relu')(model)
model = Conv2D(filters= 32, kernel_size=(3, 3), padding='SAME', activation='relu')(model)
model = Conv2D(filters= 32, kernel_size=(3, 3), padding='SAME', activation='relu')(model)
model = BatchNormalization(momentum=0.15)(model)
model = MaxPool2D(pool_size=(2, 2))(model)
model = Conv2D(filters= 32, kernel_size=(5, 5), padding='SAME', activation='relu')(model)
model = Dropout(rate=0.3)(model)

Number of filters started at 32 and doubled for each modules (32, 64, 128, 256). The number of filters were increased to detect/capture more detailed and deeper feature patterns of the image data for each proceeding convolutional module.

Hyperparameters:

kernel_size = (3, 3)
- The height and width of our convolution was set to (3, 3).
padding = ‘SAME’
- We apply padding such that the input images are fully covered by our filter and specified stride. Because we use default stride <(1,1)>, the outputs have the same dimensions as inputs (hence SAME).
activation = ‘relu’
- We use ReLU activation function.
rate = 0.3
- Dropout is applied to fifth convolutional layer of the module, and the dropout rate is set to 0.3.

Below is the architecture for entire baseline model.

model architecture

The performance of the baseline model was

	Weighted Avg Accuracy	Root Accuracy	Vowel Accuracy	Consonant Accuracy
Baseline Model	95.87%	93.61%	98.10%	98.14%

Hyperparameter Tuning

With a baseline model with fairly high performance, the next step was to tune hyperparameters to try to increase our score. For our baseline, it took about 400 seconds to train one epoch, and we ran our model for 20 epochs which took around 1 hour 20 minutes. We decreased the number of epochs from 20 to 10. Still, we needed to narrow down the range of the hyperparameters. Below were the list of hyperparameters we ultimately tuned.

Activations (for convolutional layers) = [“tanh”, “relu”]
Dropout probability (for all layers) = [0.20, 0.40]
Optimizers (for whole model) = [“nadam”, “adam”]
Batch Sizes (for whole model) = [128, 256]

Notes

1) We initially decided to tune our model with different learning rate scheduler (power, exponential scheduling). However, early on during hyperparameter tuning, we found that the exponential learning rate scheduler was performing very poorly. Therefore, we decided to set power scheduler as our learning rate scheduler, and instead decided to tune different hyperparameter: batch sizes. We had the intuition that smaller batch sizes in convolutional layers may perform better (see related article).

2) We initially tried keras hyperparameter tuner library (keras-tuner) but was unable to get the code to work. Instead, we coded our own grid search with basic for-loops.)

Results

alt text

Overall, only one model performed better than the base line model (very slightly). The weighted average validation accuracy of the best model was 94.83% and validation accuracy of grapheme root was 92.04%. The hyperparameters of the best model (model_8) were activation = relu, dropout_prob = 0.2, optimzer = adam, batch_size = 256.

Here are plots of the validation accuracy of a few of the epoch cycles for a select number of hyperparameter trials:

Model 0

model_0 Activation is tanh, dropout probability is 0.2, optimizer is adam, and batch size is 256.

Model 12

model_12 Activation is relu, dropout probability is 0.4, optimizer is adam, and batch size is 256.

You might be curious why we are missing model 6, 7, 9, 10, and 11. The reason why we made the decision to terminate the hyperparameter early was mainly due to computation/storage limitations. Even with 60GB of RAM on GCP, 16 CPUs, and 1 NVIDIA P100 GPU, with each model taking 40 minutes to train (16 models total, approx 10 hours total), we kept running out of memory.

After analyzing the first six models, we observed that Nadam optimzer performed very poorly and that batch size of 256 and lower dropout rate performed better, especialy on the validation set, which meant that they were not overfitting. Observing these clear patterns early on, we decided to stop the tuning. Instead, because we wanted to compare ‘tanh’ and ‘relu’, we trained two additional models on hyperparameters (relu, 0.2, adam, 256) and (relu, 0.4, adam, 256) and compared them to (tanh, 0.2, adam, 256) and (tanh, 0.4, adam, 256), respectively.

Final Model

We were not satisfied with the result of hyperparameter tuning. Therefore, we decided to apply more adjustments to our final model. Setting our hyperparameters to activation = relu, dropout_prob = 0.2, optimzer = adam, batch_size = 256, we decided to add one more convolutional module with filters = 16 and set number of epochs to 30 during training of our final model.

Below is the architecture for the final model:

Model: "final_model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                    
==================================================================================================
input_2 (InputLayer)            (None, 87, 106, 1)   0                                            
__________________________________________________________________________________________________
conv2d_31 (Conv2D)              (None, 87, 106, 32)  320         input_2[0][0]                    
_____________________________c_____________________________________________________________________
conv2d_32 (Conv2D)              (None, 87, 106, 32)  9248        conv2d_31[0][0]                  
__________________________________________________________________________________________________
conv2d_33 (Conv2D)              (None, 87, 106, 32)  9248        conv2d_32[0][0]                  
__________________________________________________________________________________________________
conv2d_34 (Conv2D)              (None, 87, 106, 32)  9248        conv2d_33[0][0]                  
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 87, 106, 32)  128         conv2d_34[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_7 (MaxPooling2D)  (None, 43, 53, 32)   0           batch_normalization_10[0][0]    
__________________________________________________________________________________________________
conv2d_35 (Conv2D)              (None, 43, 53, 32)   25632       max_pooling2d_7[0][0]            
__________________________________________________________________________________________________
dropout_8 (Dropout)             (None, 43, 53, 32)   0           conv2d_35[0][0]                  
__________________________________________________________________________________________________
conv2d_36 (Conv2D)              (None, 43, 53, 64)   18496       dropout_8[0][0]                  
__________________________________________________________________________________________________
conv2d_37 (Conv2D)              (None, 43, 53, 64)   36928       conv2d_36[0][0]                  
__________________________________________________________________________________________________
conv2d_38 (Conv2D)              (None, 43, 53, 64)   36928       conv2d_37[0][0]                  
__________________________________________________________________________________________________
conv2d_39 (Conv2D)              (None, 43, 53, 64)   36928       conv2d_38[0][0]                  
__________________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 43, 53, 64)   256         conv2d_39[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_8 (MaxPooling2D)  (None, 21, 26, 64)   0           batch_normalization_11[0][0]    
__________________________________________________________________________________________________
conv2d_40 (Conv2D)              (None, 21, 26, 64)   102464      max_pooling2d_8[0][0]            
__________________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 21, 26, 64)   256         conv2d_40[0][0]                  
__________________________________________________________________________________________________
dropout_9 (Dropout)             (None, 21, 26, 64)   0           batch_normalization_12[0][0]    
__________________________________________________________________________________________________
conv2d_41 (Conv2D)              (None, 21, 26, 128)  73856       dropout_9[0][0]                  
__________________________________________________________________________________________________
conv2d_42 (Conv2D)              (None, 21, 26, 128)  147584      conv2d_41[0][0]                  
__________________________________________________________________________________________________
conv2d_43 (Conv2D)              (None, 21, 26, 128)  147584      conv2d_42[0][0]                  
__________________________________________________________________________________________________
conv2d_44 (Conv2D)              (None, 21, 26, 128)  147584      conv2d_43[0][0]                  
__________________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 21, 26, 128)  512         conv2d_44[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_9 (MaxPooling2D)  (None, 10, 13, 128)  0           batch_normalization_13[0][0]    
__________________________________________________________________________________________________
conv2d_45 (Conv2D)              (None, 10, 13, 128)  409728      max_pooling2d_9[0][0]            
__________________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 10, 13, 128)  512         conv2d_45[0][0]                  
__________________________________________________________________________________________________
dropout_10 (Dropout)            (None, 10, 13, 128)  0           batch_normalization_14[0][0]    
__________________________________________________________________________________________________
conv2d_46 (Conv2D)              (None, 10, 13, 256)  295168      dropout_10[0][0]                
__________________________________________________________________________________________________
conv2d_47 (Conv2D)              (None, 10, 13, 256)  590080      conv2d_46[0][0]                  
__________________________________________________________________________________________________
conv2d_48 (Conv2D)              (None, 10, 13, 256)  590080      conv2d_47[0][0]                  
__________________________________________________________________________________________________
conv2d_49 (Conv2D)              (None, 10, 13, 256)  590080      conv2d_48[0][0]                  
__________________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 10, 13, 256)  1024        conv2d_49[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_10 (MaxPooling2D) (None, 5, 6, 256)    0           batch_normalization_15[0][0]    
__________________________________________________________________________________________________
conv2d_50 (Conv2D)              (None, 5, 6, 256)    1638656     max_pooling2d_10[0][0]          
__________________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 5, 6, 256)    1024        conv2d_50[0][0]                  
__________________________________________________________________________________________________
dropout_11 (Dropout)            (None, 5, 6, 256)    0           batch_normalization_16[0][0]    
__________________________________________________________________________________________________
flatten_2 (Flatten)             (None, 7680)         0           dropout_11[0][0]                
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 1024)         7865344     flatten_2[0][0]                  
__________________________________________________________________________________________________
dropout_12 (Dropout)            (None, 1024)         0           dense_3[0][0]                    
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 512)          524800      dropout_12[0][0]                
__________________________________________________________________________________________________
dense_root (Dense)              (None, 168)          86184       dense_4[0][0]                    
__________________________________________________________________________________________________
dense_vowel (Dense)             (None, 11)           5643        dense_4[0][0]                    
__________________________________________________________________________________________________
dense_consonant (Dense)         (None, 7)            3591        dense_4[0][0]                    
==================================================================================================
Total params: 13,405,114
Trainable params: 13,403,258
Non-trainable params: 1,856
__________________________________________________________________________________________________

Below is the validation accuracy of the final model over the 30 epochs:

final acc

Final results from this tuning is that we realized ReLU is a better activation function than tanh, it seems the lower dropout probability leads to a slight increase in the root accuracy, the adam optimizer looks a lot better than nadam, and a batch size of 256 is better than 128.

Therefore, we choose dropout = 0.2, we add one more convolutional module with filters=16, and run our final model for more 30 epochs.

After adding an additional convolutional module with filters=16 and training it for 30 epochs, we got the following results,

	Weighted Avg Accuracy	Root Accuracy	Vowel Accuracy	Consonant Accuracy
Final Model (before MC Dropout)	96.35%	94.34%	98.36%	98.38%

Next Steps

The immediate next extension we intend to try is a Monte Carlo Dropout approach to see if the performance increases even further.

In addition, we are interested in restructuring the network’s architecture by emulating some of the innovations of famous architectures like GoogLeNet (i.e. inception modules) or by extending the network to many more layers while using residual blocks.

Written on March 3, 2020