Getting Started With Convolutional Neural Networks

Edna Figueira Fernandes
4 min readMay 20, 2020

A convolutional neural network (CNN) is a deep learning algorithm that is most commonly used to analyze images. The network takes images as input and learns to identify spacial patterns such as colors, shapes, edges, and so on in order to classify the different classes. The convolutional layers are the “keys for preserving the spacial information”, essentially, in these layers, filters are applied to the images, generating feature maps containing characteristics that help distinguish between the different classes.

In order to understand how CNNs extract information from images, we first need to understand how a computer interprets images. Figure 1, shows a 15x15-pixel representation of a watermelon image.

Figure 1: watermelon image showing all the image pixels.

The image is 15 pixels high and 15 pixels long. The image also has a depth of 3 since it is a colored image, in which 3 represents RGB (the intensity of red, green, and blue colors). When the computer sees this image, it sees a 15x15x3 array (width x height x depth). When passing an image through a CNN, the network expects the input to be a tensor (number of images x width x height x depth).

The input (tensor) is then convolved before being passed to the next convolutional layer. In the convolution step, different filters are applied to the tensor. Applying these filters helps eliminate the more complex information, and helps the network focus on the more important aspects of the image. Each filter is convolved across the height and width of the image and the outputs are called the feature maps. Figure 2 helps understand this step better.

Figure 2: Basic representation of the convolution step.

To perform convolution in this image, a filter with a shape 3x3 is moved horizontally and vertically across the input image. At each step, the number at each pixel in the input image is multiplied by its corresponding number at the filter. The results of all the pixels multiplication are summed and placed in one pixel in the feature map. For visualization purposes, figure 3 shows an input image of a dog and two feature maps that resulted from applying two filters to it. In the first output image, it can be noted that the vertical lines stand out more, while on the second output the horizontal lines stand out more. These are the spatial information captured by the convolution mentioned earlier.

Figure 3: Result of applying two filters in a dog image.

After each convolution layer, it is common practice to apply an activation function. In the convolution step, the network is essentially learning linear relationships. Applying an activation function helps to add non-linearity to the network and it helps the network learn more complex relationships from the data.

Following the activation function, one may choose to apply a pooling layer, which is a downsampling technique. This technique bases off the notion that big contrasts in the features of an image are more important than the features’ exact location. Pooling is important because it helps reduce the size of the feature maps, helps with computation time, and helps prevent overfitting. The most commonly used pooling layer technique is the max pooling layer in which the maximum value is taken. This concept is illustrated in figure 4. While applying the max pooling layer, a kernel size is chosen, usually a size of 2x2 and with a stride of the same size, in this case, with a stride of 2. The kernel moves horizontally and vertically through the feature map, choosing the highest number, and generating the max pooled layer.

Figure 4: Basic illustration of the max pooling layer.

The final steps include flattening the maps to get a feature vector that is passed through fully connected linear layers. The outcome of this is the probability distribution of the classes and from there one can extract the predicted class.

References

www.udacity.com

https://en.wikipedia.org/wiki/Convolutional_neural_network

--

--