In this part, we reimplement neural style transfer discussed in this paper, where we take in a content image and a style image and output a blended image painted in the style of the style image but still keep the content of the content image.


We use the same architecture and method as proposed in the paper. The VGG19 netwrok is used to extract feature information from the images. The style representations are from layer Conv1_1, Conv2_1, Conv3_1, Conv4_1, and Conv5_1, while the content representation is from layer Conv4_2 of the original VGG network.

Model Architecture (VGG19)

  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU(inplace=True)
  (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): ReLU(inplace=True)
  (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (6): ReLU(inplace=True)
  (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): ReLU(inplace=True)
  (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): ReLU(inplace=True)
  (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (13): ReLU(inplace=True)
  (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (15): ReLU(inplace=True)
  (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (17): ReLU(inplace=True)
  (18): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (19): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (20): ReLU(inplace=True)
  (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (22): ReLU(inplace=True)
  (23): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (24): ReLU(inplace=True)
  (25): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (26): ReLU(inplace=True)
  (27): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (29): ReLU(inplace=True)
  (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (31): ReLU(inplace=True)
  (32): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (33): ReLU(inplace=True)
  (34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (35): ReLU(inplace=True)
  (36): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)


Iteration for training the network: 5000

Adam optimizer with learning rate of 0.003

Content loss weight (alpha) = 1

Style loss weight (beta) = 1e6

Show Result

In this section, we choose several different content images and style images. Based on our trained model, we presents their style transfer result. First two of the results are fantastic style transfer examples, but the last pair doesn't seem to work.

We suspect that there are two reasons why it fails. First, the highly correlated style that content image and style image share. To provide with some background information, both of the images in the last pair are from abstract art genre. Their style are rather comlicated and abstract even for a human to describle, not to mention a neural algorithm. Moreover, their content seem to be both meaningless and confusing. Human recognize a image by identifying the content in the image. Thus, if the content in the first place is unrecognizable, the content certainly will be even harder to measure in the transfered result.

Successful style transfer png


The failed style transfer


Transfer Neckarfront to different styles

Here, we transfer the Neckarfront to 6 styles as illustrated in the paper. It is clear that all Neckarfront content are still visible in the final result with some styles from the original style image. However, the style transformation is not as dramatic as shown in the paper. One reason could be the short training iteration in our implementation. If we train the model longer, the style of the final result will be closer to the style image.








Seam Carving for Content-Aware Image Resizing discusses shrinking an image vertically or horizontally to a given dimension automatically while keeping the "important" content of the image.

In general, the algorithm works as follow:

  1. Compute the energy alue for every pixel
  2. Find a vertical or horizontal path of the pixels with the least energy
  3. Delete all pixels in that path and reshape the image
  4. Repeat 1-3 till the edesireed number for rows and columns is reached

Energy Function

The most important thing in the ealgorithm is the energy function. The original paper proposed several energy function; we used the most basic one: sum of partial derivative. Specifically, for each pixel in each channel, we will compute the partial derivative in the x-axis and the partial derivative in the y-axis. Then, we will sum their absolute value together. That's it! Mathmatically, it can be described as

\(E(i) = |\frac{d}{dx}i| + |\frac{d}{dy}i|\)

where i is the pixel and E(i) is the energy value for that pixel.

Let's examine the energy map for the white bear image.

Finding Path

We could use dynamic programming to repeatedly remove least important seams in the image until it reaches the desirable dimension. We will store the curreent minimum result to a matrix M, which has the same shape as the image energy map. Then, finding the minimum value in the last row will essentially find the path that need to be deleted. Repeat this step untile the desired size reached.


Note: pictures enlarged to show details.

Here are the success results for both horizontal and vertical carving:

Failure case:

Sad to see campnile distorted ;(((

Bells & Whistles: Stretch images using seam insertion

Seam Insertion is the inverse of seam carving. The idea is therefore very similar. We first make a copy of the original image and perform the seam carving with desired size. Record all coordinates while performing the seam carving. Then, we insert new seam on the target image with the same order. The inserted artifical seam is computed by the average of the right and left seam.

What we learned

We learned that seam carving does not work well with images that have a strong pattern, the distorted campnile for example. Determining importance of pixels by energy function is also fascinating to us because of its simplicity and intuitiveness.



As this paper by Ng et al. demonstrated, capturing multiple images over a plane orthogonal to the optical axis enables achieving complex effects using very simple operations like shifting and averaging. The goal of this project is to reproduce some of these effects using real lightfield data.

In this project, we took some sample datasets from the Stanford Light Field Archive, where each dataset comprising of 289 views on a 17x17 grid.

1) Depth Refocusing

The objects which are far away from the camera do not vary their position significantly when the camera moves around while keeping the optical axis direction unchanged. The nearby objects, on the other hand, vary their position significantly across images. Averaging all the images in the grid without any shifting will produce an image which is sharp around the far-away objects but blurry around the nearby ones. Similarly, shifting the images appropriately and then averaging allows one to focus on object at different depths.

In this part of the project, we implement this idea to generate multiple images which focus at different depths. To get the best effects, we use all the grid images for averaging.

Below are averages of chess shifted to five different depths:

Or in gif format:

One more example:

2) Aperture Adjustment

Averaging a large number of images sampled over the grid perpendicular to the optical axis mimics a camera with a much larger aperture. Using fewer images results in an image that mimics a smaller aperture. In part2, we generate averages of images filtered by different radius while focusing on the same point, which corresponds to different apertures.

Below are averages of chess filtered by five different radius:

Here are the results:



Recover Homographies

Before we could warp images into alignment, we need to recover the need to recover the parameters of the transformation between each pair of images. In our case, the transformation is a homography: p’=Hp, where H is a 3x3 matrix with 8 degrees of freedom (lower right corner is a scaling factor and can be set to 1). To recover the homography, I collected a set of (p’,p) pairs of corresponding points taken from the two images using ginputs.

\[H = computeH(im1\_pts,im2\_pts)\]

In order to compute the entries in the matrix H, we need to set up a linear system of n equations (i.e. a matrix equation of the form Ah=b where h is a vector holding the 8 unknown entries of H). The system can thus be solved via four or more corresponding pairs.

Warp the Images

Now that we know the parameters of the homography, we can go ahead to warp images using this homography.

\[imwarped = warpImage(im,H)\]

where im is the input image to be warped and H is the homography.

Image Rectification

Below are the examples for image rectification! The LHS ones are original images of planar surface and the RHS are those warped to frontal-parallel.

Example1: Granola Box

Example2: Rousseau's Reveries

Blend the images into a mosaic

In this part, I warp two images taken at the Grand Canyon so they're registered and create an image mosaic. Instead of having one picture overwrite the other, which would lead to strong edge artifacts, I used linear blending to make it more seamless.


Harris Interest Point Detector

We use harris interest point detector, which is based on change of intensity when shifting a window to all directions, to automatically detect corners in the image. For left & right image below, I plot the top 2000 detected corners for each.

Adaptive Non-Maximal Suppression

Harris interest point detector returns a large number of points and many of them are clustered together. To improve upon this, we utilize Adaptive Non-Maximal Suppression (ANMS) to reduce the number of points while still keeping the remaining points spreadout. I keep the top 500 points with largest radius.

Feature Descriptor extraction & Feature Matching

For each point, we extract a 8*8 feature patch from a 40 * 40 window. Then we should normalize the image to zero mean and a standard deviation of 1.

Then, we proceed to match points between the two images by computing squared Euclidean distance between pairs of features (using dist2 from starter code). For each feature, we compute the ratio of best and second best and reject with a threshold of 0.2.


In this step, we select points that will later be used to compute the final estimated homography matrix. Intuitively, good points produce similiar homography. So we iterate 10000 times, choose 4 random feature pairs each iteration, compute homography & use it to warp img1 points to img2 & measure how close they are to true img2 points, set a SSD threshold of 0.5, and keep the feature pairs with the largest number of inliers.


Result from linear blending below:

Manual result from part A:

Auto result from part B:

Additional: (It does not perform well since the two only overlaps 20%)

Tell us what you've learned

I'm amazed at how similar auto-selected points are to my manually select ones in part A & how well they both perform.

Part 1: Image Classification

For part 1, we will use the Fasion MNIST dataset available in torchvision.datasets.FashionMNIST for training our model. Fashion MNIST has 10 classes and 60000 train + validation images and 10000 test images.

Train & Validation Accuracy

My model reaches 91.59% training accuracy and 90.99 % validation accuracy over 30 epoches using a 0.001 learning rate, adam optimizer with 0.0005 weight decay

Final Validation Accuracy | Final Test Accuracy |
| ————- |:————-:|
| 91.59 | 90.99 |

Per Class Accuracy

Classes | Accuracy
| ————- |:————-:|
| T-shirt | 0.836 |
| Trouser | 0.979 |
| Pullover | 0.871 |
| Dress | 0.896 |
| Coat | 0.853 |
| Sandal | 0.975|
| Shirt | 0.771 |
| Sneaker| 0.968 |
| Bag | 0.980 |
| Ankle Boot| 0.970 |

Which classes were hardest to classify?

The model does not perform well on T-shirt (0.836) and Shirt (0.771). I took a look at the incorrectly predicted images and there are many cases where T-shirt predicted to be Shirt and Shirt predicted to be T-shirt. It makes sense because they are both tops with sleeves and visually difficult to tell.

Correctly and Incorrectly Classified Images for Each Class

Below I will show 2 images from each class which the network classifies correctly, and 2 more images where it classifies incorrectly.

Ankle Boot
Correct 1 Correct 2 Incorrect 1 Incorrect 2
![](images/correct_img/Ankle BootTrue1actualAnkle Boot.jpg) ![](images/correct_img/Ankle BootTrue2actualAnkle Boot.jpg) ![](images/incorrect_img/Ankle BootFalse1actualSneaker.jpg) ![](images/incorrect_img/Ankle BootFalse2actualSandal.jpg)
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Correct 1 Correct 2 Incorrect 1 Incorrect 2

Learned filter visualization

Here is the visualization of first conv layer!

Part 2: Semantic Segmentation

Semantic Segmentation refers to labeling each pixel in the image to its correct object class. For this part, we will use the Mini Facade dataset. Mini Facade dataset consists of images of different cities around the world and diverse architectural styles (in .jpg format), shown as the image on the left. It also contains semantic segmentation labels (in .png format) in 5 different classes: balcony, window, pillar, facade and others. I will train a network to convert image on the left to the labels on the right.

Train & Validation Accuracy

My mode decreases to 0.8544542159427676 test loss over 30 epoches using 0.001 learning rate and adam optimizer with 0.0005 weight decay.

Model Architecture

It turns out that the simple architecture of stacks of conv, relu, max_pool and a bit dropout works the best. Please find detailed structure below.

Average Precision on Test Set

The average precision of all classes is 0.53.

Model’s Performance for My Own Collection

Overall the model does a good job on my collection. It is able to catch most of the pillars and windows. However, it predicts quite a few facade to be balcony (in the upper side of the original image) and some background to be facade.

Original image Segmented by my model


In this assignment, I will produce a “morph” animation of my face into someone else’s face (from FEI Database), compute the mean of Dane faces (using IMM Face Database), and extrapolate from a population mean to create a caricature of myself.

Defining Correspondences

I implemented a GUI to let users label points on face images using ginput according to this order and wrote the points to local to save time. Then I created the triangulation of these points using Delaunary from scipy library.

The images below show the Delaunary triangulation of me and target face.

Me Target

Computing the “Mid-way Face”

Firstly, I computed the average of label points from the two original faces to get the average shape. Then I warped both faces into the average shape by applying an affine warp for each triangle from original to average. Lastly, I took the average the color of the two warped images together simply by $\frac{img1 + img2 }{2}$.

Below are the original images and the mid-way face I got.

Me Mid-way Target

The Morph Sequence

For this part, I implemented a function $morphed_{im} = morph(im1, im2, im1_pts, im2_pts, tri, warp_frac, dissolve_frac)$ which produces a warp between im2 and im2 using the corresponding points defined by $im1_pts$ and $im2_pts$ and the triangulation structure $tri$. By iterating over $warp_frac$ and $dissolve_frac$ from 0 to 1, I created a sequence of 50 intermediate faces.

Below is a animation $gif$ of the 50 images of me morphing into target.

The “Mean face” of a population

For this part, I computed the average male Dane face using the initially released subset of the IMM Face Database (37 images) by morphing each of the faces in the database to the average shape. Below are some examples.

Raw 0 Raw 1 Raw 2 Raw 3 Raw 4
Morphed 0 Morphed 1 Morphed 2 Morphed 3 Morphed 4

Average Face of the Dane Population

The left is my face warped into the average geometry while the right is the average face warped into my geometry.

Me on Dane Avg Dane Avg on Me
### Caricatures: Extrapolating from the mean

Below, I produced a caricature of my face by extrapolating from the population mean from last step. We can observe that from left to right as $warp_frac$ increases, my smile becomes increasingly exaggerated. ;D

Raw Caricature 1 Caricature 2 Caricature 3 Caricature 4

Bells and Whistles 1: Change Gender

I changed my gender by adding the difference between average han chinese man and average han chinese woman (from this post).The result image is a bit quirky.

Original Avg Shape
Avg Color Shape + Color

Bells and Whistles 2: Produce a face-morphing music video of the students in the class!

Shoud out to Zixian Zang for organizing this!

Part 1: Fun with Filters

Part 1.1: Finite Difference Operator

How gradient magnitude computation works

First, we calculate the partial derivative $dx$ and $dy$ respectively in x and y direction of the cameraman image by convolving the image with finite difference operators D_x and D_y. Then we take square sum of the two derivatives to get the gradient magnitude image, using the fomula $\sqrt[]{dx^2 + dy^2}$.


Part 1.2: Derivative of Gaussian (DoG) Filter

We noted that the results with just the difference operator were rather noisy. Thus by convolving with a 2d gaussian, I created a blurred version of the original image and repeated the same thing as the previsous part.

What differences do you see?

Comparing gradient magnitude images, I noticed that after gausssian, there is less noice (for example, fewer dots in the lower half the image), but the borders are thicker/more blurred because we are using a low pass filter and each pixel contains information from neighboring ones.

Also, instead of convolving the image with a 2d gaussian and getting the partial derivatives (method 1), we can first take the x y derivatives of the gaussian kernel and then apply those two derivative of kernals to the image(method 2). The two methods give same result.

Verify same result

method 1 method 2
alt alt

Part 1.3: Image Straightening

Since human beings have a preference for vertical and horizontal edges in most images. I coded up an algorithm trying to rotate the image so as to maximize the number of vertical and horizontal edges.

First, I proposed a range of angles. Then, I rotated the image by the proposed angle, used $arctan(dy/dx)$ to get the gradient degrees for the rotated image, graphed a histogram, and counted sum of horizontal and vertical edges. After iterating over the proposed set of angles, and pick the angle that produces the highest number of edges to straighten the image.


The first four look quite good. However, my algorithm failed on the fifth one. Looking into the reasons, I think human (at least me) care more about the straightness of the water pipe, since the wall is mainly a background. However, the algorithm accounts both the water pipe and the lines on the wall.

Part 2: Fun with Frequencies!

Part 2.1: Image “Sharpening”

We can make an image looks sharper by adding more high frequency to it. Specifically, we subtract the blurred version from the original image to get the high frequencies of the image, add the difference back to the original image (scaled by alpha).


For evaluation, I also picked a sharp image (taken at Amazon), blur it and then try to sharpen it again. Observing the images below, we may notice that the sharpened image is sharper than the blurred one, but still not as good as the original. After all, blurring takes away information from the original image.

original vs blurred vs sharpened

Extra images



Part 2.2: Hybrid Images

Low-pass filter (implemented with gaussian) is the part we see far apart and the high-pass one (implemented by substracting low pass from the original) is what we see closely. Hybrid image is created by summing up low-pass and high-pass.

Below are a few hybrid images.

nutmeg Derek hybrid 1 (nutmeg of derek)
alt alt alt
big cat little cat hybrid 2 (big cat or little cat)
alt alt alt
eat not eat hybrid 3 (Eating or not)
alt alt alt

failure example


Fourier analysis

The fourier analysis show how hybrid images works in the frequency domain.

The big cat passes through a low-pass filter and gets rid of much high frequency. The little cat passes through a high-pass filter and only keeps high frequency. The frequency distribution of the hybrid image look similar to the two original images, since low and high frequencies both exist in the hybrid images.

Show the log magnitude of the Fourier transform of the two input images, the filtered images, and the hybrid image.



Part 2.3: Gaussian and Laplacian Stacks

Gaussian stack is generated by repeating convolving with gaussian while a Laplacian stack is generated by subtracting two layers of gaussian stacks.

I applied my Gausssian and Laplacian stacks to Mona Lisa. Please find the result beneath.


Part 2.4: Multiresolution Blending (a.k.a. the oraple!)

For this part, I blended two images seamlessly using a multi resolution blending as described in the 1983 paper by Burt and Adelson. An image spline is a smooth seam joining two image together by gently distorting them. Multiresolution blending computes a gentle seam between the two images seperately at each band of image frequencies, resulting in a much smoother seam.

There are a few examples.

image1 image2 oraple
alt alt alt
image1 image2 blended
alt alt alt
image1 image2 blended
alt alt alt

Irregular mask


This assignment is super interesting, especially when I am never good at photoshop but realize I can do the same with code. ;D


Before the 20th century, color photography had not yet become widespread - developments in the field were still rudimentary, at best. Sergei Mikhailovich Prokudin-Gorskii (1863-1944), a Russian photographer, was the one responsible for spearheading work in color photography with his implementation of the Three-color principle.

Starting in 1907, he traveled through the Russian Empire, documenting the vast landscapes in color for the first time. Over 10,000 pictures were taken in the form of RGB glass plate negatives, which have luckily survived the years.

An example of RGB example of glass plate image can be found here.



In this project, I used image processing techniques to automatically colorize the glass plate images taken by Prokudin-Goskii. In order to do this, I extracted the three color channel images, placed them on top of each other, and aligned them so that they form a single RGB color image.


Basic Stacking

A naive implementation is to place R, G, B on top of each other without aligning. Please find below an example output of naive implementation.


It looks blury since R, G, B do not align well with each other.

The easiest way to align the parts is to exhaustively search over a window of possible displacements. I use np.roll to test out how different displacements for images align with image2 (base). Specifically, I used two for loops iterating [-20, 20] possible displacements for x and y. Note that the window should not be too large for computational efficiency. I used SSD (Sum of Squared Differences) as the metrics to tell how well two images align. For each possible pair of x and y, I got their SSD align score. After all iterations are done, I chose the displaced image1 with lowest SSD score.

I used one trick for better alignment - there are black borders for each channel. These borders does not contribute to how well two images align. Thus, I implemented a function crop images off their borders, which is called when calculating SSD sccores.

Outputs of exhaustive search can be found below:

Green(2, 5) Red(3, 12) Green(2, -3), Red(2, 3)

Green(3, 3) Red(3, 6) |

Image Pyramid Aligment

For a (340, 390) image, it makes sense to just search in a small window of [-20, 20] to find the optimal displacement vector. However, for larger images, say [3400, 3900], possible displacements are very likely to exceeds [-20, 20] window. Thus, it is reasonable to make a larger window. However, it becomes computationally expensive.

Thus, an improvement should be made. I utilized a coarse-to-fine 7 layor pyramid search algorithm. The rescale factor starts from 1/pow(2,7) and times 2 during each recursion up. The pyramid search algorithm ended up using only ~30 sec to process a large tif high-res image.

Outputs of pyramid search can be found below:

Green(16, 59) Red(13, 124) Green(17, 40) Red(23, 89)
Green(24, 49) Red(-669, 163) Green(9, 48) Red(11, 112)
Green(10, 82) Red(13, 179) Green(-5, 40) Red(-32, 108)
Green(26, 51) Red(36, 108) Green(28, 78) Red(36, 176)
Green(14, 53) Red(11, 112) Green(5, 42) Red(31, 87)

I found that emir.tif is not aligned well using pyramid search. Looking into that, I realized R has a different brightness from G and B, which causes the wrong alignment.

Extra Images

Green(-5, 40) Red(-32, 108) Green(8, 35) Red(17, 124)