博客信息 | 网络视频组

Abstract:

The conditional adversarial networks can deal with the image-to-image translation problems. And it not only learn the mapping from input image to output image, but also learn a loss function to train this mapping, avoiding hand-engineering loss functions.
This approach can effectively synthesize photos from label maps, reconstruct objects from edge maps, and colorize images, among other tasks.

Introduction:

The traditional image-to-image translation problems need separate, special-purpose machinery, so in this paper, the writer want to propose a common framework for all the problems.
The CNN can be applied in many prediction problems, however, we should tell the CNN what we wish it to minimize (tell the loss function), which is can not be naive, so it could result in blurry.
If the loss function can be learned automatically, the network is called GAN. It tries to learn a loss that tries to classify if the result is real or fake and minimize this loss. In this way, the blurry images can not be tolerated.(learning generative model of data)
The writer’s contribution: 1>demonstrate that on a wide variety of problems, cGANs can produce reasonable results. 2>present a framework sufficient to achieve good results and analyze the effects.

Related works:

The traditional classifications treat the pixels of input images as independent. However, the cGANs treat it as structured.
The proposed GAN is not tailored. It can be used in general.

Method:

Traditional GANs learn a mapping from random noise Z to output image Y. However, the cGANs learn from observed image X and noise Z to image Y.
The G-generator and D-discrimintor are adversarial.
The objective is as below:
Without z , the net could still learn a mapping from x to y, but would produce deterministic outputs.In initial experiments, the generator simply learned to ignore the noise. So, in the final models, the writer provide noise only in the form of dropout, which result in stochasticity in the output of the nets.
Previous work uses the encoder-decoder network, which the input can pass through a series of layers that progressively downsample, until a bottleneck layer, at which point the process is reversed. However, there is always some information can not be downsampled, the writer add skip connections between each layer i and layer n-i as below.(Add details)
L2 and L1 loss will result on blurry results on image generation problems. Despite this, the accurate low frequency prediction is satisfying. So we do not need an entirely new framework to enforce correctness at the low frequencies.( L1 is OK)
The GAN only tries to model high-frequency structure (L1 tries to model low frequency), so restrict user’s attention to the local image patches is sufficient, besides, this kind of Patch-GAN has fewer parameters, runs faster and can be applied widely.

Experiment:

The writer applies the method to a variety of tasks and datasets as below: 1>Semantic labels – photo, trained on the Cityscapes dataset. 2>Architectural labels – photo, trained on the CMP Facades dataset. 3>Map aerial – photo, trained on data scraped from Google Maps. 4> BW – color photos. 5>Edges – photo. 6>Sketch – photo. 7>Day – night
The writer compares the L1, GAN, cGAN, GAN+L1 and cGAN+L1 methods(objective function), and gives a quantitative and qualitative result of pixel accuracy and class accuracy. Besides, the writer compares the colorfulness of each method.
The writer compares the generator architecture(U-net and encoder-decoder structure). It turns out that the U-net structure has clearer edges.
The writer compares the size of patch both on qualitative and quantitative. From PixelGAN to PatchGans to ImageGANs.

Ref:

[YixuanBan Notes]Image-to-Image Translation with Conditional Adversarial Networks