Author: Yufei Wang (@ywang2395)

Links

For more information about the link provided here, see the Work Distribution and Code Source Declaration part at the end of the report.

Introduction

Applying filters on a certain kind of recurring objects is a common need in visual content creation. For instance, the image on the left side in the figure below applied color splash filter on balloons, in which all content are turned into grayscale except the balloons; and the image on the right side is clipped from a TV show, in which the passerby is blurred to protect his privacy. In both of these situations, visual content creators need to identify recurring objects when they deal with large amount of images, and apply filters on these specified objects.

However, identifying recurring objects on large amount of images is tedious. If we could make an algorithm to automatically identify the objects and apply image processing methods on them, visual content creators could be freed from repetitive work and focus more on the creations process. To solve the problem, we could roughly divide it into three different stages: Detecting the object and classify it, find the segmentation mask of the object, and do the image processing on the specified parts. The figure below illustrates the process.

Object Detection

The problem definition of object detection is to determine where the object is located in a given image, and which category the object belongs to. Or in other words, it also consists of two parts: object localization, and object classification. The location of an object is usually represented by its bounding box, which is a rectangle boundary of the object in the given image. Nowadays, a lot of deep learning frameworks have been developed for dealing with object detection problems. For instance, YOLO is a real time object detection algorithm, which could find the location of all objects in an image and classify them in just one pass. Other frameworks like Mask R-CNN could not only detect the object, but also find its semantic segmentation by giving its pixel-level mask. For solving our problem, we not only need to classify the object, but also need to get the segmentation mask of the object, so we choose Mask R-CNN as the backbone for object detection and mask generation. A detailed explanation of Mask R-CNN could be found in the next section.

Image Processing

After we find the object in the images, we could do some image processing on the desired area by referring to the segmentation mask of the objects. Some rough ideas could be easily thought out, like blurring the detected parts - this is a common need when processing videos, for instance, in some situations we want to blur all human faces or license plates in the video. Some other image processing methods could also be applied, like the color splash filter we have mentioned above, pixelation, or even the style transfer proposed by Gatys et al. Applying image processing methods seems easy, but it could also be challenging, as we only want to apply it on desired parts of the image, and the processed part should also be consistent with the whole image. I will introduce more about this part in details in the next section.

Method

Mask R-CNN

Mask R-CNN is actually a framework for object instance segmentation. It extends Faster R-CNN, by adding a branch for predicting an object mask, in parallel with the existing branch for bounding box prediction. The mask could be easily trained for instance segmentation, and it could also be used for other tasks such as human pose detection. The structure of Mask R-CNN is illustrated below.

Mask R-CNN is roughly consisted of three parts: the Region Proposal Network (RPN), the bounding box regression and classification branch, and the mask generator branch, which are briefly explained below:

Fine Tuning

Pre-trained models of Mask R-CNN could be easily found on the Internet, and these models are usually trained on a relatively large dataset, like COCO. However, if we directly apply those pre-trained model on user-specified images, it may lead to some problems. For instance, the Mask R-CNN model per-trained on COCO dataset would regard UW-Madison logo as a stop sign, which is illustrated below.

In general, using pre-trained model will possibly cause some problems below:

To solve the problems above, the pre-trained model needs to be find-tuned on the user-specified dataset. When fine-tuning the model, the whole model will be initialized with the weights from the pre-trained model, but the final classification layer will be removed and new classification layers with specified number of neurons (which is corresponding to the number of classes in the user-specified dataset) are attached. Then the new model will be continued training on the new dataset. With the knowledge learned in the pre-trained model, the new model will converge faster on the new dataset.

To find-tune Mask R-CNN, I followed a cascaded process: firstly only the RPN, the bounding box regression and classification branch, and the mask generator branch will be tuned, and all variables in the backbone network will be froze. Next, some layers in the backbone network will be freed and also be tuned. It is like a cascaded process of several stages, and this could provide better performance in practice.

Image Processing Methods

I apply four different kinds of image processing methods: color splash, blurring, pixelation and style-transfer. Color splash, blurring and pixelation will commonly be regarded as traditional methods, while style-transfer is also a technique utilizing neural network. I will introduce them respectively in the following sections.

Color Splash, Pixelation and Blurring

Examples of each of these three effects are illustrated below. Color Splash is a image filter, after applying it only the specified object is kept in color; For pixelation the object will be pixelated; and for blurring the object will be blurred.

Implementing Color Splash filter is straight-forward. Firstly the whole image will be turned into the gray scale, and then the image will be merged with the original image based on the object segmentation mask. For Pixelation and Blurring I follow the same idea - apply the image processing method on the whole image, then merge the processed image with the original image based on the object segmentation mask. For pixelation, I simply slide a rectangle window on the image, and replace all pixels in the window with the mean value of them. For blurring the image I apply median filter on the original image to get the blurred image.

Style Transfer

Style Transfer is a technique proposed by Gatys et al. at 2015. In paper A neural algorithm of artistic style they proposed new metrices to compare the “style distance” and the “content distance” from the original image to the target image. An example of image style transferring is illustrated below.

At the left side, it is the original image; and in the middle there are two style images. If we regard the style distance and the content distance as the loss, and optimize to get a new image which minimize the style distance with the style image and the content distance with the content image, the images on the right side will be generated. The new image will keep the content in the original image, while the style will be similar to the style image.

How do they define the style loss and the content loss? The content distance is straight forward - they define the content loss to be the squared L2 distance between two feature maps, which could be got from a per-trained deep neural network like VGG. The content distance (or the loss) is defined as the formula above. For the style loss, things are a little bit complicated: they firstly compute the correlation between two different feature maps in the same layer, and use the result to construct a correlation matrix G, which is defined below. Then, they compute the squared l2 distance from the correlation matrix G of the style image to the correlation matrix G of the original image, and normalize the result to make sure it keeps the same scale with the content distance.

For generating the final image, in the original paper they start from a random noise image, and optimize the content loss and the style loss to get the final result. However, the whole process is really slow and may take several seconds to get a new image, which is not acceptable when processing large amount of images or the video.

To solve the problems above, I follow the instructions in Perceptual losses for real-time style transfer and super-resolution, to build a fast neural style transfer network. The network contains several convolution and transpose convolution layers, which is trained on COCO dataset and with the target to minimize the style loss of every image in the training set with the target style image, while keeping the content of the image. After training, the network will be able to change the style of any input image to the target style, within one second of running time.

Image Mixing

As described above, for applying image processing methods on specified objects in the image, an easy idea is to firstly apply the image processing method on the whole image, then use the object segmentation mask to mix two images. However, directly using the mask to mix two images will lead to hard edgse on the edges of the mask, especially for the case when the processed image is not similar to the original image. An example is illustrated below.

We have met this situation in hw3. However, as the mask of the two images do not overlap with each other (the mask of the two images are actually complementary), the method we used in hw3 could not be applied here. Instead, I found a simple method which could smooth the edge: firstly the mask will be converted to an binary image, in which 1 will be mapped to 255 and 0 will be mapped to 0. Then, gaussian blur filter will be applied on the mask, and the blurred mask will be used to mix these two images. An example is illustrated below, in which the final mixed image does not have clear edges.

From Images to Video

Visual content creators sometimes may need to process videos. For instance, a TV show provider may need to blur all passersby in the video. We already introduced methods above to apply filters on images, so it seems straight-forward to apply the same method on videos: video is just a sequence of image, so we just need to apply the object detection and image processing methods for each frame.

However, the result provided by Mask R-CNN is not robust, and sometimes the mask generated for one frame will be totally different from the last frame. Here is an example, in which you could see that the masks of the objects are flashing in the video.

To solve this problem, I created a pixel-level stabilizer for stabilizing the masks in a video. An example of the working process of the stabilizer is illustrated below.

The stabilizer works by majority voting. For the example at the top, even the first pixel is true on frame 0, its adjacent frames do not have supporting results. Therefore in the stabilized result, the true pixel will be ignored. For the example at the bottom, the first pixel is true on frame 0, and its adjacent frames do have this corresponding true pixel. Therefore in the final result, the first pixel is also true.

The stabilizer is not suitable for fast changing videos, so it could be disabled in the program for handling that case. In most of the cases it could improve the results of the video, especially when applying on the case illustrated above.

Results

Model Fine Tuning

For fine-tuning the model, I set up running environment on a Google Cloud machine with a Nvidia K80 Graphics card included. I still use the COCO dataset for training, but this time I only use the image with person class included, and only train the model on the person class. As I specified to fine-tune some layers in the backbone network ResNet-101, it takes around 50 hours to finish and costs me $30. The fine tuned model removed unnecessary categories, and does improve the mask accuracy on person class. An example is illustrated below.

On the left side is the result from the per-trained model, in which the hand of the girl is not included in the mask. On the right side is the result from fine-tuned model, in which the mask is much more accurate. I provided the fine-tuned model in the links section.

Final Result

I applied the model trained only on person class, with video stabilizer enabled and smooth image mixer enabled, to generate some sample results for each image processing methods mentioned above. The original video is illustrated below, which is clipped from the video on grad.wisc.edu/apply.

The result of applying Color Splash filter is illustrated below.

The result of applying pixelation filter is illustrated below.

The result of applying blurring filter is illustrated below.

The result of applying style transfer filter is illustrated below. The style image used for training the network could be obtained at wiki.com, and the filter is only applied on the background.

Problems, Analysis and Future Work

The result above looks great. However, there are still several problems which could be improved:

And there are some future work could be done:

I will possibly dig into some of these ideas when I have free time, as it is a really interesting project and I will keep working on it.

Thanks again for reading my report!

Code Source Declaration

The rest parts of the code is basically implemented by me, including the fast style transfer code. I also referred to some pages on Stack OverFlow when I work on my implementation.

Work Distribution

At the beginning of the semester, I am working with Emma Liu (@liu763). The Problem Statement section and Why is the Problem Important section in the proposal are written by her, and she helped me fix some grammar mistakes in the mid-term report. The rest of the project, including the code, all the results, the final presentation slides and the video, the mid-term report and the webpage is completely done by me independently.

We splitted our work at the beginning of the semester: I will work on the Mask R-CNN part and she will work on YOLO. As there is a pandemic and everything is moved online, it is not easy to sync up with each other. She will possibly include more information about YOLO, to solve the first problem I have mentioned above. Please refer to her page for more information.

References