CNN
The most common applications of Computer Vision include image classification, object detection, and image segmentation. Convolutional Neural Networks (CNN) is a type of neural networks tailored to solve these kinds of vision problems. It is a supervised machine learning technique because you need to provide truth sample data to train the network. Infact drawing bounding boxes around objects of interest and labelling the image or part of the image are the most laborious part of solving the vision problem.
There are 2 key building blocks to CNN: convolutional layer and pooling layer. The architecture of CNN is like that of the human eye, it is hierarchical. The deeper layers capture higher level features constructed on the smaller features produced by the lower layers. The deeper the network, the more the network can do to recognize objects. However when network gets deeper you run into the vanishing gradient problem. Resnet help address that by introducing skip connections.
Convolutional layer act as a filter, the pooling layer performs down sampling to highlight important features to pass downstream.
Stride and Padding
Padding is added to convolutional layers to prevent information on the borders from being lost. Stride determines the movement of the filter that iterates through an input image and is changed to control the density of convolution.