Convolutional Neural Network, Why CNN

Navya Paithara
6 min readJun 30, 2021

Convolution: The operation of multiplying pixel values by weights and summing them is called “convolution”

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and differentiate one from the other.

Let us understand the relationship between convolution and neural network using image classification example

Consider a 20x20 image and you want to classify to which class (assume 4 classes like human, car, aeroplane, flower) this image belongs to.

Let’s try to understand the techniques before CNN

  1. The first approach could be to make 20x20 as 400 dimension input feature vector and project it as a 4 class classification task
  2. The second approach is Feature engineering here shapes matter to us for classification, not the colour, an edge detector could help us here to identify shapes. This a refined feature. The filter we choose here is domain-specific
  3. SIFT/HOG This basically tells how do the gradients of pixels vary, here the feature extraction is static which means there is no learning, only learning happens in the weights of the classifier like the softmax layer. It is a deterministic algorithm. SIFT is a scale Invariant feature. The SIFT features are local and based on the appearance of the object at particular interest points, and are invariant to image scale and rotation.

Here comes the concept of CNN

Instead of handcrafted kernels like edge detectors, can we learn sensible kernels in addition to weights of the classifier, much more efficient way could be learning multiple sensible kernels in the same layer? This could avoid underfitting.

Multiple kernels at multiple layers would be even more efficient.

A network with multiple learnt convolution operations at every layer is called a Convolutional Neural network.

How different this is from a regular Neural network and what are the benefits

The main advantage of CNN over regular neural networks is

1. Sparse Connectivity

2. Weight Sharing

Sparse Connectivity

consider a 4x4 image containing a digit 2, we consider a 16 dimensions input feature and 10 classes(digits) output.

we can have as many hidden layers and activation units per layer.

In a traditional fully connected feed-forward neural network, all the 16 dimensions of input will be connected to the hidden layer, subsequently, every output of the hidden layer is connected to the next layer.

There are many dense connections.

Regular Feedforward Network

In CNN, the connections are sparse. Let us understand this clearly.

Let h11 be the first neuron in the hidden layer 1. In a regular feed-forward neural network, all the 16 inputs are contributing to the computation of h11. Whereas in CNN, only a few inputs participate based on the kernel size.

In our example, if we consider the filter size as 2X2 with stride as 2 only a few neurons participate in the computation of h11.

Only pixels 1,2,5,6 contribute to h11

Pixels 3,4,7,8 contribute to h12

9,10,13,14 contribute to h13

11,12,15,16 contribute to h14

Convolution operation

The connections are sparser, here our major focus is on the interactions between neighbouring pixels. Evidently, this sparse connectivity is reducing the number of parameters.

But we might get a doubt aren’t we losing interactions between some input pixels but not really.

Let’s see this clearly with the below example.

Here x1 and x2 pixels did not interact, because h2 is dependent on x1,x2,x3 similarly h4 is dependent on x3,x4,x5 there is no unit that depends on both x1 and x5.

In the later layer, g3 is dependent on h2 and h4 where the interaction between all the pixels x1, x2, x3, x4, x5 is happening.

Even though the interaction did not happen in the initial layer, the interaction is obvious as we go deeper in the network

Pixel interaction in the later layers

Weight Sharing

Another important characteristic of CNN is weight sharing

1. The same kernel or filter is convolved across the entire image, i.e we want to learn the effect of the same kernel across the entire image.

2. We can have many such kernels but the kernels will be shared by all locations of the image

Let us look into few basic architectures of CNN like LeNet, AlexNet

LeNet:

It is a 5 layered network as we have learnable parameters at 5 layers.

The Input image that we are considering here is a grey scale image 32X32X1. Since it is a grey scale image the number of channels is one

Convolution operation is performed on input image with kernel/filter size as 5X5 and

Input = 32X32X1

Kernel = 5X5

Stride = 1; Padding = 0

Number of kernels = 6

Output = ((32–5+1) X (32–5+1) X 6)

(28 X 28 X 6)

In the pooling or the subsampling layer, the feature map is reduced by half and the number of channels remains the same. The resultant would be 14 X14 X 6

Subsequently, in the next layer, we are applying convolution with kernel size as 5X5 and 16 suck kernels are used. The resultant after convolution would be 10 x 10 x 16

Again after pooling/Subsampling layers, the feature map will be reduced by half. Hence the outcome here could be 5 x 5 x 16

Further convolution operation performed on the above resultant with filter size as 5 X 5, 120 such kernels are used and flattened further. Followed by 2 fully connected layers.

LeNet -5 Architecture

The architecture details are further explained clearly in the below tabular format.

The number of parameters learned are calculated in the below fashion.

Let us write python code for LeNet architecture

import tensorflow as tffrom tensorflow import kerasimport numpy as np

Loading data

(train_x, train_y), (test_x, test_y) = keras.datasets.mnist.load_data()train_x = train_x / 255.0test_x = test_x / 255.0train_x = tf.expand_dims(train_x, 3)test_x = tf.expand_dims(test_x, 3)val_x = train_x[:5000]val_y = train_y[:5000]

Creating LeNet Model Architecture

lenet_5_model = keras.models.Sequential([keras.layers.Conv2D(6, kernel_size=5, strides=1, activation='tanh', input_shape=train_x[0].shape, padding='same'),keras.layers.AveragePooling2D(),keras.layers.Conv2D(16, kernel_size=5, strides=1, activation='tanh', padding='valid'),keras.layers.AveragePooling2D(),keras.layers.Flatten(),keras.layers.Dense(120, activation='tanh'),keras.layers.Dense(84, activation='tanh'),keras.layers.Dense(10, activation='softmax')])

Compiling the Model

lenet_5_model.compile(optimizer='adam',loss=keras.losses.sparse_categorical_crossentropy,metrics=['accuracy'])

Fitting the Model

lenet_5_model.fit(train_x, train_y, epochs=5, validation_data=(val_x, val_y))
lenet_5_model.evaluate(test_x, test_y)
lenet_5_model.summary()

AlexNet:

AlexNet is an 8 layered network because we have learnable parameters at 8 layers. It consists of 5 convolutional layers, 3 fully connected layers.

The total learnable parameters are 62 million

Below is the architecture of AlexNet

AlexNet Architecture

When an input layer image of 227 x 227 x 3 is convolved with 96 kernels with each 11 x 11 size the outcome would be ((227–11)/4 + 1) x (227–11)/4 +1) x 96) = 55 X 55 X 96

The above outcome is passed to the pooling layer which results in the reduction of passed input, here we are considering a 3 x 3 filter for pooling. Outcome after max pooling is ((55–3)/2+1) x (55–3)/2+1) x 96) = 27 x 27 x 96

Just Like LeNet code, we can build the architecture of AlexNet and fit it on the input images after resizing

Below is the step by step procedure for building AlexNet architecture.

Alexnet = keras.models.Sequential([keras.layers.Conv2D(filters=96, kernel_size=(11,11), strides=(4,4), activation='relu', input_shape=(227,227,3)),keras.layers.BatchNormalization(),keras.layers.MaxPool2D(pool_size=(3,3), strides=(2,2)),keras.layers.Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), activation='relu', padding="same"),keras.layers.BatchNormalization(),keras.layers.MaxPool2D(pool_size=(3,3), strides=(2,2)),keras.layers.Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),keras.layers.BatchNormalization(),keras.layers.Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),keras.layers.BatchNormalization(),keras.layers.Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),keras.layers.BatchNormalization(),keras.layers.MaxPool2D(pool_size=(3,3), strides=(2,2)),keras.layers.Flatten(),keras.layers.Dense(4096, activation='relu'),keras.layers.Dropout(0.5),keras.layers.Dense(4096, activation='relu'),keras.layers.Dropout(0.5),keras.layers.Dense(10, activation='softmax')])

Model Summary

Alexnet.summary()
AlexNet Model Summary

--

--