Modern NST – Neural Style Transfer (Part I)

This week I completed the 4th course of deeplearning.ai, hosted in Coursera and teach by Andrew NG. As part of the course I was able to learn about NST, or Neural Style Transfer.

Now you will probably be wondering what the hell is NST. We can find a brief under the introduction at the white paper written by Yongcheng Jing et al.

The recent work of Gatys et al. demonstrated the power of Convolutional Neural Networks (CNN) in creating artis- tic fantastic imagery by separating and recombing the im- age content and style. This process of using CNN to migrate the semantic content of one image to different styles is re- ferred to as Neural Style Transfer. Since then, Neural Style Transfer has become a trending topic both in academic lit- erature and industrial applications. It is receiving increas- ing attention from computer vision researchers and several methods are proposed to either improve or extend the orig- inal neural algorithm proposed by Gatys et al. However, there is no comprehensive survey presenting and summa- rizing recent Neural Style Transfer literature. This review aims to provide an overview of the current progress towards Neural Style Transfer, as well as discussing its various ap- plications and open problems for future research.

I will go into the details in further post. There are few concepts over the definition that is worth to understand, like CNN, how they work and why are CNN been so important on image related stuff.

As a short introduction, a non mathematical one, NST needs a huge amount of pre-trained data. For the example shown below I used a VGG-19 architecture (we will get there while talking about deep neural networks architectures). For loading the model, I used the ImageNet dataset, since starting NST from scratch doesn’t make sense. The neural networks, firstly, needs to learn how to detect the patterns on an image the highly “activates” a Neuron.

So, from a high level point of view, NST is easy to understand. We will have 3 images:

  • content image
  • style image
  • generated images

The cost function for the content and style ones are a bit different but both are based on the same mathematical principle, a Gram Matrix. Calculating this matrix per epoch from the previous steps allows the network and reducing the cost functions ends with an output (in my case I used the 4th of 19th convolution layer as output one for performance reasons, an experiment using CuDa with Nvidia GTX 1060 took near to 7 hours), you will end up seen this as generated images.

    with    

will run over 220 epochs, every 1.5 seconds updates with the result of 20 epochs. Ideally should run until 400, but after not seen many changes on the last iterations I stop remove the 180 left.

Be ready for the next post :P. Comments and feedback are always welcome.