Convolutional neural networks: 2. zero paddings
Convolutional neural networks (CNN)
In the previous posting, we discussed how a (two-dimensional) convolutional layer works in comparison to a fully connected layer, which is used for a basic building block for a feedforward neural network (FNN). A convolutional layer is a basic building block for a convolutional neural network (CNN), but there are other components for a CNN, one of which we discuss in this posting.
Reference
Our main reference is the lecture notes by Smets, an excellent reference for mathematicians who pursue deep learning.
Review: convolutional layers
Let's first recall the general setting of a convolutional layer from the previous posting. We expect $c$ matrices $X = (X[0], X[1], \dots, X[c-1])$ of fixed size, say $h \times w$ to each such matrix $X[k]$, we have $c'$ trainable matrices
$$K[0,k], K[1,k], \dots, K[c'-1,k]$$
which we filter $X[k]$ with. The recipe for the weight matrix $W : \mathbb{R}^{chw} \rightarrow \mathbb{R}^{c'(h-m+1)(w-m+1)}$ is
$$W_X = (W_X[0], W_X[1], \dots, W_X[c'-1]),$$
where
$$W_X[l] = K[l, 0] \star X[0] + K[l, 1] \star X[1] + \cdots + K[l, c-1] \star X[c-1],$$
each of whose summand is given by the (2-dimensional discrete) cross-correlation: that is, the $(i,j)$-entry of $K[l, k] \star X[k]$ is defined as
$$(K[l,k] \star X[k])_{i,j} := \sum_{i',j'=0}^{m-1}K[l, k]_{i',j'} X[k]_{i+i', j+j'}$$
for $0 \leq i \leq h-m$ and $0 \leq j \leq w-m$.
Here are some terminologies often used:
- $c$ is called the number of input channels, and
- $c'$ is called the number of output channels.
We note that each output channel corresponds to each filter. Such a filter applies to all $h \times w$ matrices over all input channels by cross-correlation operation on the left, and we sum all the result to get the corresponding component for the weight matrix output. (Then the bias vector component is added, which provide more trainable parameters; then we may apply an activation map such as ReLU or tanh.) We again emphasize the great visualization due to Visin that explains this. Note if we fix an input channel $k \in [0, c-1]$ and an output channel $l \in [0, c'-1]$, what we do above is transforming the $h \times w$ matrix $X[k]$ into $(h - m + 1) \times (w - m + 1)$ matrix $K[l, k] \star X[k]$. Since we assume $0 \leq m < h, w,$ this guarantees that the size of images shrink. (The dimension of tensors can also be considered to be shrinking if $c = c'$.)
Comments
Post a Comment