Image Classification With CNNs
Introduction
The objective of this project was to build and test three convolutional neural networks to perform image classification on a dataset containing images X-rays of patients. The goal was to have the model predict whether the sample is from someone suffering from COVID19, Pneumonia, or neither. The dataset used for this is the ‘Chest X-ray dataset’ available at https://www.kaggle.com/datasets/prashant268/chest-xray-covid19-pneumonia. Several python libraries were needed to complete this analysis. As the project was completed in google colab, the os module was used to locate and fetch the directory with the required X-Ray image data. TensorFlow was extensively used for pre-processing the images as well as for creating our prebuilt models and adding layers to our final model. Matplotlib was used to visualise some of the evaluation metrics of our model such as model accuracy and the loss function. Scikit learn was used for processing such as splitting the data into training, validation, and testing data. However, it was also used for import classification matrices such as the classification report, accuracy score, and the raw confusion matrix. Finally, seaborn was used in the visualisation process such as visualising the confusion matrix in a neater manner.
The images were all the same size and in grayscale, however, we did resize the images to 64x64. The data came already split into a training set and a testing set. So instead, the train test split was performed on the training dataset, with 25% being used for the validation set while 75% was used for the training.
TESTING DIFFERENT MODELS
To explain how the two models we selected work, we must first explain how convolutional neural networks (CNNs) work in general. CNNs work using special hidden layers called convolution layers that can detect patterns within images by filters. Different filters within a layer have different functions (e.g. some may be specialised to detect edges, some to detect circles, etc) and each filter convulses through the image in a shape determined by its kernel size to detect different patterns in the image. Earlier convolutional layers in a model have more general/basic capabilities such as detecting straight lines but the deeper the layers in a model, the more detailed it gets with later layers able to detect whole objects (Yamashita, Nishio, Do & Togashi, 2018). So, in brief, the role of convolution layers is to use element-wise multiplication between inputs and kernels to extract different features from the image and produce feature maps. These feature maps are then passed to a pooling layer which effectively reduces the number of features in a feature map. This has the advantages of reducing the number of parameters to learn, but more importantly by ‘summarising’ features present in a region of a feature map, it makes the model more generalizable to changes in the position of specific features (Yamashita, Nishio, Do & Togashi, 2018). The feature maps of the final convolution/pooling layer are flattened into a 1D array and are passed to the last layer activation function which will then make a prediction of the appropriate class.
VGG-16
The first CNN model we selected was the VGG-16 which was originally developed by Simonyan and Zisserman (2014) and was trained on the ImageNet dataset and was submitted to the LARGE-SCALE VISUAL RECOGNITION CHALLENGE (ILSVRC) (an international competition between image recognition CNNs) and achieved 92.7% top-5 accuracy in classifying ImageNet.
As VGG16 is a CNN it uses a similar infrastructure of convolutional layers, pooling layers, and dense layers. As the name suggests, it has 16 weight layers that are learnable, i.e., the 13 convolutional layers and the 3 fully connected dense layers. All hidden layers use the RELU activation function except the final layer which uses SoftMax activation instead to predict a class with a certain level of confidence.
VGG16 was chosen as its out-of-the-box architecture can have good accuracy. In addition, it is simple to understand as well as customise by adding additional layers as well as adjusting the size of kernels. This had to be balanced against the potential drawbacks of the VGG16. For example, due to the several parameters it has, it can have over 550MB of weight size making it take a long time to run. In addition, increasing its number of layers may not be effective as it suffers from the vanishing gradient problem beyond a certain point.
RESNET 50
The second model chosen was ResNet50 as it emerged after the VGG16 and achieved even better results at the ILSVRC competition the VGG16 in 2015 with an error rate of just 3.57 on image net and placing first. So, we believed it could improve on the results of VGG-16 using Resnet for a few key reasons. Intuitively, adding more layers to a CNN should improve performance as more layers mean more features can be picked out. However, as previously discussed, with VGG-16 because it suffers from the vanishing gradient problem, adding more layers leads to the model’s accuracy saturating and then decreasing. Previous experiments have shown this with a 56-layer model getting a higher training error and test error than its 20-layer counterpart (fig 1.), (He et al., 2015).
However, the ResNet model does not suffer from the vanishing/exploding gradient problem and therefore has the scope for adding more layers to its architecture to allow the model to be able to pick out more features from the pictures and therefore classify images more accurately (fig 2).
ResNet achieves this by using deep residual networks whose architecture allows identity mapping using skip layers. Skip connections directly input/map the output of a previous layer to that of a layer ahead (fig 3). This allows a “shortcut” path for the gradient to flow through thus making the model more robust against the vanishing gradient problem. This then allows more layers to be made on the model, allowing for improved classification and this was perfectly exemplified by the authors of the architecture of the ResNet’s own research where they used 100 and 1000 layers on the CIFAR-10 dataset with good results.
MODEL EVALUATION
The VGG-16 and RESNET-50 models were first run without any fitting and without any additional hidden layers. The only layer added was the dense output layer with three neurons and softmax activation to allow the model to classify the image into one of the three categories with a representative of the model’s confidence. Not fitting the model meant none of the weights had the opportunity to be trained to adapt to the inputs resulting in poor performance (fig 4), with RESNET50 only having 65.8% accuracy while VGG-16 fared much worse with an accuracy of 47.1%.
The code for both of these unfitted model is as follows:
The accuracy of these two out-of-the-box models are :
When looking at Fig 5, one of the reasons for poor performance becomes evident. The VGG-16 model contains almost 15 million parameters, however, only 1539 were actually trainable. ResNet50 has almost 24 million total parameters and only 513 003 are trainable. Although this is a small proportion, it is still much higher than that of VGG-16 which may be one reason for the better performance.
ResNet scoring slightly higher combined with our intuition about its scoping for having more layers meant that ResNet-50 was selected as the winner for further development. The next step was fitting the ResNet-50 model without any addition of layers etc., in order to obtain a baseline. This results in a much-improved accuracy of 92.8% with fig 6 showing the model accuracy and validation loss.
IMPROVED MODEL
When first looking to improve the ResNet-50 model, it was noted that there was a sharp decrease in the number of neurons, from 2048 in the input layer to just 3 in the output layer of neurons. This would mean it was unlikely that the model had enough representation power to preserve all the useful information in the images. Therefore, the team decided to implement a pyramid style of CNN to include more neurons in hidden layers. Initially, this was through a dense layer of 1024 neurons, but this worsened performance to 91.2%, and an additional layer with 512 neurons saw similar results. However, examination of the model accuracy showed undulation in the validation loss insinuating that the use of dropout layers could help prevent overfitting (Hilton et al.,2012). This saw an immediate improvement in the results with the accuracy improving to 93.7%. The team, therefore, looked to continue inductively, by adding the next layer with 256 layers and an associated dropout layer again improving the performance. However, continuing the pyramid structure by adding a 64-neuron layer with an associated dropout rate of 0.5 the accuracy was reduced to 93.5%. Therefore, the architecture of our model was as in fig 7.
Following this we looked to adapt the learning rates of the model. Due to the graph showing no signs of overfitting, the team looked to increase the learning rate to assess whether this would improve the overall accuracy of the model. The learning rate was increased from 0.001 to 0.01 resulting in the accuracy dropping to 66.4% (fig 8).
So instead, the batch size was increased from 32 to 64 and although this did not yield significant changes, the accuracy did slightly drop to 93.6%. The team, therefore, concluded the model (fig 7.) was the best that the team could achieve.
When considering the number of epochs, a practical solution was picking an arbitrarily high number of epochs in conjunction with early stopping on the validation loss to stop overfitting. The graphs show there was no overfitting present, however, both curves flattened towards the end showing that a further number of epochs was unlikely to improve the overall performance of the model and early stopping was called as a result. This means the final accuracy of our model was ≈ 94% Fitting the model and adding our own layers are the reason for the improved performance with the number of trainable layers rising from 513, 003 in the default architecture to 2,787, 578 in this one. this is because the additional layers allowed the weights to be trainable which allows the smooth conversion of knowledge from the pre-trained network to our specific dataset. The smoothness of the curves is probably due to the precautions taken through the use of dropout layers to prevent overfitting as well as the learning rate which is at the lower range of the usual values. A future way to potentially improve our model is adding additional layers such as convolution layers and pooling inside the neural network. Additionally, instead of resizing the images, the larger original images could be used to train the model however, this would require more computing power.