Convolutional neural networks: 2. zero paddings

Convolutional neural networks (CNN)

In the previous posting, we discussed how a (two-dimensional) convolutional layer works in comparison to a fully connected layer, which is used for a basic building block for a feedforward neural network (FNN). A convolutional layer is a basic building block for a convolutional neural network (CNN), but there are other components for a CNN, one of which we discuss in this posting.

Reference

Our main reference is the lecture notes by Smets, an excellent reference for mathematicians who pursue deep learning.

Review: convolutional layers

Let's first recall the general setting of a convolutional layer from the previous posting. We expect $c$ matrices $X = (X[0], X[1], \dots, X[c-1])$ of fixed size, say $h \times w$ to each such matrix $X[k]$, we have $c'$ trainable matrices

$$K[0,k], K[1,k], \dots, K[c'-1,k]$$

which we filter $X[k]$ with. The recipe for the weight matrix $W : \mathbb{R}^{chw} \rightarrow \mathbb{R}^{c'(h-m+1)(w-m+1)}$ is

$$W_X = (W_X[0], W_X[1], \dots, W_X[c'-1]),$$

where

$$W_X[l] = K[l, 0] \star X[0] + K[l, 1] \star X[1] + \cdots + K[l, c-1] \star X[c-1],$$

each of whose summand is given by the (2-dimensional discrete) cross-correlation: that is, the $(i,j)$-entry of $K[l, k] \star X[k]$ is defined as

$$(K[l,k] \star X[k])_{i,j} := \sum_{i',j'=0}^{m-1}K[l, k]_{i',j'} X[k]_{i+i', j+j'}$$

for $0 \leq i \leq h-m$ and $0 \leq j \leq w-m$.

Here are some terminologies often used:

$c$ is called the number of input channels, and
$c'$ is called the number of output channels.

We note that each output channel corresponds to each filter. Such a filter applies to all $h \times w$ matrices over all input channels by cross-correlation operation on the left, and we sum all the result to get the corresponding component for the weight matrix output. (Then the bias vector component is added, which provide more trainable parameters; then we may apply an activation map such as ReLU or tanh.) We again emphasize the great visualization due to Visin that explains this. Note if we fix an input channel $k \in [0, c-1]$ and an output channel $l \in [0, c'-1]$, what we do above is transforming the $h \times w$ matrix $X[k]$ into $(h - m + 1) \times (w - m + 1)$ matrix $K[l, k] \star X[k]$. Since we assume $0 \leq m < h, w,$ this guarantees that the size of images shrink. (The dimension of tensors can also be considered to be shrinking if $c = c'$.)

Paddings: prevention of shrinking images

What if we do not want shrinking images when transform $X[k]$ into $K[l, k] \star X[k]$? A natural idea is to extend the size of $X[k]$ by adding zero entries on the four directions top, bottom, left, and right, say by $p_t, p_b, p_l, p_r$ so that it becomes $(h + p_t + p_b) \times (w + p_l + p_r)$ matrix, which we denote by $Z_{p_t, p_b, p_l, p_r}X[k]$. That is, the entries whose indices are in $[p_t, h + p_t - 1] \times [p_l, w + p_l - 1]$ must form the same submatrix as $X[k]$, while all the other entries are $0$. More formally, the $(i, j)$-entry of the extended matrix is given by

$$Z_{p_t, p_b, p_l, p_r}X[k]_{i,j} := \left\{\begin{array}{ll} X[k]_{i - p_t, j-p_l} & \mbox{ if } (i, j) \in [p_t, h + p_t - 1] \times [p_l, w + p_l - 1], \\ 0 & \mbox{ otherwise }, \end{array}\right.$$

for $0 \leq i \leq h + p_t + p_b - 1$ and $0 \leq j \leq w + p_l + p_r - 1$.

In the implemented version of PyTorch Conv2d, it is assumed that $p := p_t = p_b = p_l = p_r$. We shall focus on this case from now on and write $Z_{p}X[k] := Z_{p_t, p_b, p_l, p_r}X[k]_{i,j}$. We say $Z_{p}X[k]$ is the zero padding of $X[k]$ by $p$.

Remark. Of course, one can easily generalize different kinds of paddings by using different ways to choose extended entries other than zeros. However, the zero padding is the most common because it seemingly does not change our data, while keeping the goal of convolutional layer.

It seems to be common to choose $m$ (the size of filter/kernel) to be an odd positive integer. One benefit of this is to make sure that $p = (m - 1)/2$ is an integer. With this, we note that the composition

$$X[k] \mapsto Z_{(m-1)/2}X[k] \mapsto K[l, k] \star Z_{(m-1)/2}X[k]$$

sends an $h \times w$ matrix to an $(h+m-1) \times (w+m-1)$ matrix, which is then sent to an $h \times w$ matrix, as we desired!

Remark. This also appears in Section 4.4.1 of Dumoulin and Visin, which is a helpful reference to check whether or not the numbers, sizes, indices, etc match.

What's next

We now know that we can construct a convolutional layer of the form $\mathbb{R}^{chw} \rightarrow \mathbb{R}^{c'hw}$ thanks to zero paddings. Without zero padding, a convolutional layer has size $\mathbb{R}^{chw} \rightarrow \mathbb{R}^{c'(h - m + 1) \times (w - m + 1)}$ and with zero padding of size $p$, the convolutional layer becomes of the form $\mathbb{R}^{chw} \rightarrow \mathbb{R}^{c'(h - m + 1 + 2p) \times (w - m + 1 + 2p)}$.

In the next posting, we continue to explain how we construct a full network $\mathbb{R}^{chw} \rightarrow \mathbb{R}^{d}$ from here.

Search This Blog

AI/ML for mathematicians