卷积神经网络初窥——VGG19学习笔记

Deep Learning最开始就是从CV方向火起来的，而CV中最常用的模型就CNN（卷积神经网络，theano实现一个CNN模型可以参考CNNModel.py）。当然了目前CNN已经不局限于只在CV领域使用，NLP中也开始广泛应用CNN，具体可参看我的另外一篇博客《卷积神经网络（CNN）在NLP中的应用》（github传送门：GRU-or-CNN）。在ILSVRC2012分类任务上，CNN开始崭露头角， Alex Krizhevsky他们的模型取得了16.4%的错误率（远超第二名26.2%的错误率）。而后来的VGGNet模型，在ILSVRC2012分类任务上将top-5的错误率降到了7.3%。VGGNet包含两个模型，VGG-16和VGG-19，就最近而言我见到不少基于这两模型的方法（项目），包括，

生成图片描述，输入一幅图片，模型生成一句描述该图片的话。github传送门：neuraltalk

看图写话，这个有点类似小学生作文啊，是生成图片描述的升级版，就是生成一大段浪漫的话（和训练语料有关）来描述图片内容。github传送门：neural-storyteller

看图回答问题，这篇论文也很有意思，给定一副图片和一个和图片相关的问题，模型给出答案，比如问图片中有多少个人，或者问桌子旁边是什么。其实模型结构和第一个生成图片描述的模型差不多，也是把图片向量视作RNN的第一个输入，接下来输入为问题。其实论文的亮点在于通过图片表述数据集构建图片问答数据集，这个创新还是值得一赞的。github传送门：neural-vqa。论文传送门：《Exploring Models and Data for Image Question Answering》(NIPS 2015)

计算机学绘画，这是2015年NIPS上的一篇论文《A Neural Algorithm of Artistic Style》。给模型两幅图片，其中一幅为风格图片，则模型会输出另一幅图片使其画风符合风格图片。github传送门：Art Style Transfer

根据涂鸦绘画（相当于计算机参考绘画），这是上一个模型的升级版。给定一张图片，以及该图片的semantic annotations（涂鸦），再给定一张涂鸦（期望生成图片的大致样子），然后计算机就能生成一幅你意愿的图片。论文：《Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artwork》，github传送门：neural-doodle

这里还有一个非常有意思的模型，虽然没有用到VGG-16或VGG-19，但用到了CNN，其基本思想还是差不多的。模型很有意思，采用对抗学习的思想，模型包含两部分，一部分用于生成图片，另一部分用于判断生成的图片是否是一个合理的图片（比如一张人脸），通过对抗训练，同时提升两部分的性能，最后得到一个生成特定图片的模型（比如一张人脸）。github传送门：torch-gan

这里插一句，上面列出的6个项目或论文中，前三个都是CV和NLP相结合的应用。CV和NLP都是AI的分支，将两者相结合的应用绝对是AI的一个宏伟梦想，所以CV+NLP基本可以算是之后AI研究的新发展方向了，尤其是在当下deep learning大火的背景下。

既然VGG-16或VGG-19应用如此广泛，那我们就来简单学一下这两个模型吧。下面以VGG-19为例，来看一下 lasagne实现的版本（lasagne是基于Theano封装的深度学习库）：

from lasagne.layers import InputLayer
from lasagne.layers.dnn import Conv2DDNNLayer as ConvLayer
from lasagne.layers import Pool2DLayer as PoolLayer
def build_model():#构建VGG-19卷积网络
    net = {}
    net['input'] = InputLayer((1, 3, IMAGE_W, IMAGE_W))
    net['conv1_1'] = ConvLayer(net['input'], 64, 3, pad=1)
    net['conv1_2'] = ConvLayer(net['conv1_1'], 64, 3, pad=1)
    net['pool1'] = PoolLayer(net['conv1_2'], 2, mode='average_exc_pad')
    net['conv2_1'] = ConvLayer(net['pool1'], 128, 3, pad=1)
    net['conv2_2'] = ConvLayer(net['conv2_1'], 128, 3, pad=1)
    net['pool2'] = PoolLayer(net['conv2_2'], 2, mode='average_exc_pad')
    net['conv3_1'] = ConvLayer(net['pool2'], 256, 3, pad=1)
    net['conv3_2'] = ConvLayer(net['conv3_1'], 256, 3, pad=1)
    net['conv3_3'] = ConvLayer(net['conv3_2'], 256, 3, pad=1)
    net['conv3_4'] = ConvLayer(net['conv3_3'], 256, 3, pad=1)
    net['pool3'] = PoolLayer(net['conv3_4'], 2, mode='average_exc_pad')
    net['conv4_1'] = ConvLayer(net['pool3'], 512, 3, pad=1)
    net['conv4_2'] = ConvLayer(net['conv4_1'], 512, 3, pad=1)
    net['conv4_3'] = ConvLayer(net['conv4_2'], 512, 3, pad=1)
    net['conv4_4'] = ConvLayer(net['conv4_3'], 512, 3, pad=1)
    net['pool4'] = PoolLayer(net['conv4_4'], 2, mode='average_exc_pad')
    net['conv5_1'] = ConvLayer(net['pool4'], 512, 3, pad=1)
    net['conv5_2'] = ConvLayer(net['conv5_1'], 512, 3, pad=1)
    net['conv5_3'] = ConvLayer(net['conv5_2'], 512, 3, pad=1)
    net['conv5_4'] = ConvLayer(net['conv5_3'], 512, 3, pad=1)
    net['pool5'] = PoolLayer(net['conv5_4'], 2, mode='average_exc_pad')
    return net

lasagne实现非常简单，就涉及两个函数Conv2DDNNLayer和Pool2DLayer。Conv2DDNNLayer中重要的就是前4个参数，官方文档为，

2D convolutional layer
Performs a 2D convolution on its input and optionally adds a bias and applies an elementwise nonlinearity. This is an alternative implementation which uses theano.sandbox.cuda.dnn.dnn_convdirectly.

Parameters:

incoming : a Layer instance or a tuple
The layer feeding into this layer, or the expected input shape. The output of this layer should be a 4D tensor, with shape(batch_size, num_input_channels, input_rows, input_columns).

num_filters : int
The number of learnable convolutional filters this layer has.

filter_size : int or iterable of int
An integer or a 2-element tuple specifying the size of the filters.

stride : int or iterable of int
An integer or a 2-element tuple specifying the stride of the convolution operation.

pad : int, iterable of int, ‘full’, ‘same’ or ‘valid’ (default: 0)
By default, the convolution is only computed where the input and the filter fully overlap (a valid convolution). When stride=1, this yields an output that is smaller than the input by filter_size - 1. The pad argument allows you to implicitly pad the input with zeros, extending the output size.
A single integer results in symmetric zero-padding of the given size on all borders, a tuple of two integers allows different symmetric padding per dimension.
‘full’ pads with one less than the filter size on both sides. This is equivalent to computing the convolution wherever the input and the filter overlap by at least one position.
‘same’ pads with half the filter size (rounded down) on both sides. Whenstride=1 this results in an output size equal to the input size. Even filter size is not supported.
‘valid’ is an alias for 0 (no padding / a valid convolution).
Note that ‘full’ and ‘same’ can be faster than equivalent integer values due to optimizations by Theano.

untie_biases : bool (default: False)
If False, the layer will have a bias parameter for each channel, which is shared across all positions in this channel. As a result, the b attribute will be a vector (1D).
If True, the layer will have separate bias parameters for each position in each channel. As a result, the b attribute will be a 3D tensor.

W : Theano shared variable, expression, numpy array or callable
Initial value, expression or initializer for the weights. These should be a 4D tensor with shape(num_filters, num_input_channels, filter_rows, filter_columns). Seelasagne.utils.create_param() for more information.
b : Theano shared variable, expression, numpy array, callable or None
Initial value, expression or initializer for the biases. If set to None, the layer will have no biases. Otherwise, biases should be a 1D array with shape(num_filters,) if untied_biases is set to False. If it is set to True, its shape should be (num_filters, output_rows, output_columns) instead. Seelasagne.utils.create_param() for more information.

nonlinearity : callable or None
The nonlinearity that is applied to the layer activations. If None is provided, the layer will be linear.

flip_filters : bool (default: False)
Whether to flip the filters and perform a convolution, or not to flip them and perform a correlation. Flipping adds a bit of overhead, so it is disabled by default. In most cases this does not make a difference anyway because the filters are learnt. However, flip_filters should be set to True if weights are loaded into it that were learnt using a regularlasagne.layers.Conv2DLayer, for example.

**kwargs
Any additional keyword arguments are passed to the Layer superclass.

简单解释一下Conv2DDNNLayer，第一参数为输入数据，第二个参数是卷积核的个数，第三个参数是卷积核的size（输入一个tuple或单个值，如果只输入单个值i，则size=(i,i)），pad表示在边界上添加多少个值为0的像素点。比如，
输入：$n*c_i*h_i*w_i$
输出：$n*c_o*h_o*w_o$，其中$h_o = (h_i + 2 * pad_h - kernel_h) /stride_h + 1$，$w_o$通过同样的方法计算。
公式中$n$表示的batch_size，$c_i$表示输入通道数目（num_input_channels），$c_o$表示输出通道数目（既num_filters）。当前层的卷积核，既参数矩阵W的shape为$(c_o,c_i,kernel_h,kernel_w)$，具体计算过程为对输入的$c_i$个通道用shape为$(c_i,kernel_h,kernel_w)$对其进行卷积（每个通道对应一个卷积核$(kernel_h,kernel_w)$，这样得到$c_i$个卷积结果，将这个$c_i$结果叠加，既得到一个shape为$(h_o,w_o)$这样的输出）。有$c_o$个不同的卷积核，所以得到$c_o$个shape为$(h_o,w_o)$的输出。
激活函数默认为 rectify。
Pool2DLayer也是一样的，第二个参数是pool核的size，同样输入单值i，size=(i,i)。mode=’average_exc_pad’表示取平均。Pooling是对每一个输入通道进行单独操作，没有参数，同时也不会改变通道的数目。

下面看以mxnet的实现（mxnet也是一个深度学习库，支持python等多种语言，github传送门：mxnet），比lasagne稍微复杂一点，但更加清晰。可以很清晰的看出VGG-19的模型层次，基本就是：卷积->非线性化->再卷积->非线性化->…->Pooling->下一个卷积层…

data = mx.sym.Variable("data")
    conv1_1 = mx.symbol.Convolution(name='conv1_1', data=data , num_filter=64, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu1_1 = mx.symbol.Activation(name='relu1_1', data=conv1_1 , act_type='relu')
    conv1_2 = mx.symbol.Convolution(name='conv1_2', data=relu1_1 , num_filter=64, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu1_2 = mx.symbol.Activation(name='relu1_2', data=conv1_2 , act_type='relu')
    pool1 = mx.symbol.Pooling(name='pool1', data=relu1_2 , pad=(0,0), kernel=(2,2), stride=(2,2), pool_type='avg')
    conv2_1 = mx.symbol.Convolution(name='conv2_1', data=pool1 , num_filter=128, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu2_1 = mx.symbol.Activation(name='relu2_1', data=conv2_1 , act_type='relu')
    conv2_2 = mx.symbol.Convolution(name='conv2_2', data=relu2_1 , num_filter=128, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu2_2 = mx.symbol.Activation(name='relu2_2', data=conv2_2 , act_type='relu')
    pool2 = mx.symbol.Pooling(name='pool2', data=relu2_2 , pad=(0,0), kernel=(2,2), stride=(2,2), pool_type='avg')
    conv3_1 = mx.symbol.Convolution(name='conv3_1', data=pool2 , num_filter=256, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu3_1 = mx.symbol.Activation(name='relu3_1', data=conv3_1 , act_type='relu')
    conv3_2 = mx.symbol.Convolution(name='conv3_2', data=relu3_1 , num_filter=256, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu3_2 = mx.symbol.Activation(name='relu3_2', data=conv3_2 , act_type='relu')
    conv3_3 = mx.symbol.Convolution(name='conv3_3', data=relu3_2 , num_filter=256, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu3_3 = mx.symbol.Activation(name='relu3_3', data=conv3_3 , act_type='relu')
    conv3_4 = mx.symbol.Convolution(name='conv3_4', data=relu3_3 , num_filter=256, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu3_4 = mx.symbol.Activation(name='relu3_4', data=conv3_4 , act_type='relu')
    pool3 = mx.symbol.Pooling(name='pool3', data=relu3_4 , pad=(0,0), kernel=(2,2), stride=(2,2), pool_type='avg')
    conv4_1 = mx.symbol.Convolution(name='conv4_1', data=pool3 , num_filter=512, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu4_1 = mx.symbol.Activation(name='relu4_1', data=conv4_1 , act_type='relu')
    conv4_2 = mx.symbol.Convolution(name='conv4_2', data=relu4_1 , num_filter=512, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu4_2 = mx.symbol.Activation(name='relu4_2', data=conv4_2 , act_type='relu')
    conv4_3 = mx.symbol.Convolution(name='conv4_3', data=relu4_2 , num_filter=512, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu4_3 = mx.symbol.Activation(name='relu4_3', data=conv4_3 , act_type='relu')
    conv4_4 = mx.symbol.Convolution(name='conv4_4', data=relu4_3 , num_filter=512, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu4_4 = mx.symbol.Activation(name='relu4_4', data=conv4_4 , act_type='relu')
    pool4 = mx.symbol.Pooling(name='pool4', data=relu4_4 , pad=(0,0), kernel=(2,2), stride=(2,2), pool_type='avg')
    conv5_1 = mx.symbol.Convolution(name='conv5_1', data=pool4 , num_filter=512, pad=(1,1), kernel=(3,3), stride=(1,1), no_bias=False, workspace=1024)
    relu5_1 = mx.symbol.Activation(name='relu5_1', data=conv5_1 , act_type='relu')
    # style and content layers
    style = mx.sym.Group([relu1_1, relu2_1, relu3_1, relu4_1, relu5_1])
    content = mx.sym.Group([relu4_2])
    out = mx.sym.Group([style, content])