Spatial Pyramid Matching for Scene Classification

The spatial pyramid matching algorithm for scene classification is an improvement over the conventional Bag of Words (BoW) approach. In this method a filter bank of standardized filters with differing length scales was used to extract pixel level responses. For example, an mxn 3-channel image run on a filter bank of 20 filters would result in 60 filter response intensities for each of the mxn pixels in that image.

Feature Extraction and Building Bag-of-Words Dictionary

This process was performed on a large training dataset that contained examples of the scenes that were to be classified. Once all the responses had been collected a random sub-sampling was performed to reduce the size of the training set. In order to create the reference bag-of-visual-words dictionary, these subsampled points from every image were fed into a k-means clustering algorithm to cluster similar filter responses into k nearest neighbor clusters. The center of these clusters indicated the final “words” of the dictionary.

In order to perform the classification, a word-map was created by applying the same filter-bank on the test images and then replacing each pixel with the nearest visual-word from the dictionary. At this point, a histogram of the word frequency of the test image could be generated and compared with the corresponding histograms from the training set, and the class image nearest to the test image can be declared the class type of the scene. This however has the disadvantage in that it overlooks spatial information of feature distribution in the image.

Word Maps on Test Images

This was countered by using a multi-level pyramid scheme where the images are divided into cells (sub-images) after which histograms were computed for each cell and continuously appended to the histogram of the next level of discretization. This has the advantage of uniquely capturing spatial distribution information.

Histogram Computation and Comparison

Finally, this histogram was compared with the corresponding histogram chains from the training set and a similarity score used to find the closes class to which the test image belonged.


Augmented Reality with Planar Homographies

In this implementation, the objective was to solve for the homography matrix that transforms the image co-ordinate in one image frame to another assuming that the mutual world coordinates between the two image coordinates lie on the same plane in the world. In this particular example, feature points between two images are first identified using the FAST feature detector and BRIEF descriptor.

Detected Feature Points between Book Covers

Once reliable features had been detected between the images, the corresponding image coordinates are used to set-up the system of equations given by:

Equation

where the Ai matrix is comprised of all the image correspondence coordinates and the h vector comprises the elementes of the Homography matrix for which we attempt to solve. The Ai matrix set up using correspondences are used to determine the homography matrix using the Singular Value Decomposition or Eigen Value Decomposition. However, fitting the homography matrix into all the correspondences produces a poor fit. Instead, the RANSAC approach was used to iteratively pick 4 correspondences at random and fit a homography after which that homography was selected for with the maximum number of inliers among the correspondences.

Homography Matrix Used to Fit A Harry Potter Cover onto Textbook

The next part of the project involved creating an augmented reality video where a clip from the movie Kung Fu Panda was overlaid on a moving video of the textbook cover. This was done by computing correspondences between the textbook cover image and each frame of the video and then applying the RANSAC fitted homography onto a frame from the Kung Fu Panda and overlaying it onto the corresponding frame in the textbook video.

AR Application – Movie Clip Overlaid on Textbook Video

Finally, the concept of homographies was used to create a panorama from two sample images using point correspondences and then warping one image onto the other. Here are two images from the 4th floor of the Robotics Institute fused to form a panorama.

A Beautiful Day outside the Robotics Institute

Lucas-Kanade Tracking

The first part of this project involved making a template tracker using the Lucas Kanade algorithm. The problem can be formulated such that given a rectangular template and a sequence of frames, we need to determine the translational shift in the template from the current frame to the next frame such that the squared difference of pixel intensities is minimized. The following equation shows the squared difference objective function that is to be minimized.

Squared Difference Objective Function

Here, the term p indicates the translational shift that takes the template from the first frame to the second frame such that it minimizes the objective function. The optimal translation can be computed iteratively by linearizing the image intensity of the next frame locally and solving for the desired update in p.

Linearizing Image Intensity about Template Location

The shift in p is computed until its L2 norm falls below a minimum threshold. The following gallery shows the naive Lucas Kanade tracking on two frame sequences.

In this naive implementation, the reference template is continuously updated which can lead to drift over a long period of time. An alternate approach would be to utilize a template update strategy wherein the template is updated only when a certain condition is met.

Under this condition, if the translational shift in the template between frames is within a minimum specified threshold, only then is the template updated with the one from the new frame, else the reference template for the next pair of frames is retained from the previous pair of frames. The performance of the updated algorithm is depicted in the following gallery. The blue window represents the result of the naive algorithm, and the red window shows the results after template correction.

In the second part of this project, an affine motion subtraction algorithm was used to determine dominant motion between frames. This was unlike the first implementation where the motion between frames was assumed to be pure translational. In this implementation, it was assumed that the entire motion between one frame to the next could be captured through an affine warp of one frame onto the next. This can be expressed mathematically as:

Affine Warp applied to Image Coordinates (x,y) where the Affine Warp is Parametrized by p1 to p6

The above affine transformation can be substituted in the original squared difference objective function, linearized and then assembled in vectorized form to solve for the incremental change in vector p (p1 to p6 parametrizing the affine warp) iteratively. Locations of dominant motion were determined by subtracting the warped frame from the subsequent frame and identifying those pixel regions with magnitude above a certain threshold. The results of this motion detection algorithm on two sequences are shown in the following gallery. In the first sequence shot aerially the overhead camera is also in motion which causes some artifacting as certain regions of the sequence apart from just the cars also show significant motion.

The final part of this project involved implementing the affine motion subtraction problem using the inverse composition technique. In this method, the Jacobian and Hessian matrix of the template are pre-computed by linearizing the template intensities. The incrementally computed warp is then inverted and composed to determine the forward warp between the frames. The advantage here is that the Jacobian and Hessian can simply be computed once and reused in every iteration saving computation time. In this particular implementation, a total savings of 55% time was observed on average between the two algorithms. The final results are identical to the animations in the previous gallery.


3D Reconstruction

In this implementation, the goal was to estimate the fundamental and essential matrices for a camera given pairs of stereo images and perform 3D reconstruction of the matching points. The following image shows the sample stereo pair of images from the Middlebury multi-view dataset.

Temple Image Pair from the Middlebury Multi-view Dataset

In the first part of this assignment, a set of pre-defined feature correspondences were provided between the two images. The 8-point algorithm was used to estimate the fundamental matrix using an over-determined system of equations. The computed fundamental matrix was then used to visualize the epi-polar lines corresponding to the matching points in each of the images shown below.

Selected Points in the Left Image and Corresponding Epi-polar Lines in the Right Image

Using the estimated fundamental matrix and the known camera intrinsics of both the images, the Essential matrix was then computed. The Essential matrix was then used to compute the camera extrinsic matrices. This leads to four possible solutions for the rotation matrix when solving the system of equations using SVD. In order to dis-ambiguate between the solutions, two of the rotation matrices were retained which had their determinant equal to 1. Between the remaining two solutions, the matrix that resulted in the largest number of points in front of both the cameras when paired with both positive and negative translations was selected.

With the estimated extrinsic matrices and the provided intrinsic matrices of the cameras, a system of equations was created using the known image co-ordinates of the corresponding points in the images. This system was set up and solved based on the principle of least squares since we require that world coordinate which reprojects onto the image plane as close to the original 2D points as possible.

In the following portion of the assignment, the actual 3D reconstruction was implemented. Given a selection of pixel coordinates from the first image, an SSD metric was used to compare small windows on the second image along the epi-polar line corresponding to the selected point in the first image. Selected and identified points are shown below.

Selected Points from the Left Image and the Corresponding Points on the Epi-polar Line

Once the correspondences were determined, the fundamental matrix and then the essential matrix was solved for, following which the camera matrices were estimated. Finally, the world coordinates corresponding to the matches were estimated and plotted as shown in the next image.

Reconstructed World Points Shown in Different Views

Finally, a bundle adjustment algorithm was used to re-implement the camera matrix and world co-ordinate estimation by jointly optimizing for the reprojection error.


Neural Networks for Recognition

This assignment involved creating a single hidden layer neural network for handwriting character identification. The network was supposed to detect characters belonging to 36 classes (26 alphabets and 10 digits). The data set used for training, validation and testing was the NIST-36 dataset. Some sample hand-written characters from the dataset are shown below.

The hidden layer was designed with 64 hidden units. Each image is a 32×32 size image unrolled into a 1024 size vector. The two weights matrices were therefore of size 1024×64 and 64×36 respectively. The sigmoid activation function which performs well for probabilistic classification was used for the first layer and a soft-max activation function was used for the final layer in order to get the classification size 36 vector. The weight matrices were initialized with a Xavier initialization.

The total data set consisted of 10800 training images, 3600 validation images and 1800 test images. A batch optimization gradient descent was used with a batch size of 5 images and training was performed over 50 epochs. A cross-entropy loss function was used for optimization. The results of the accuracy and loss on the training and validation image sets are shown below.

An accuracy of around 74% was observed on the validation dataset. The confusion matrix shown below demonstrates certain expected trends of the network. The high intensities along the main diagonal of the confusion matrix indicate correct guesses by the network. The off-diagonal intensities indicate incorrectly identified characters. Commonly confused character pairs can be observed to be “O” and “0”, “1” and “I”, “6” and “G”, “5” and “S”, “2” and “Z” etc.

Confusion Matrix

The next part of the assignment involved performing character segmentation on four sample images and then using the trained neural network to extract the text from each character image. This was done by segmenting each character, cropping and resizing the images into 32×32 images and then passing it through the trained network to identify the text. The segmentation was implemented using a simple image processing algorithm that identified connected groups of pixels in the image and determined bounding boxes for each character. The results of the character segmentation are shown below.

The text after identification and classification is shown below.

ABCs after Character Identification
“Deep Learning” after Character Identification

Although not 100% accurate, the performance is still quite satisfactory with a few expected character pairs being mis-identified such as “5” and “S” “2” and “Z” etc.

The third part of the assignment involved implementing an auto-encoder to learn compressed lower dimensional feature representations of character images. The auto-encoder was implemented using the ReLU activation function with 1024X32 and 32X32 dimension layers which was then fed to the decoder having 32X32 and 32×1024 dimension layers. The loss function used was the total squared error between the input and output images.

The following images show the result of encoding and decoding the certain sample characters.

The average PSNR over the validation set was about 15 dB. This is fairly low quality due to the increased noise in the image but the important features that have been captured can be seen in the decoded figures.


Photometric Stereo

Photometric stereo is a method to extract the shape of an object under different lighting conditions. In the first part of this assignment, the n-dot-l model was used to render a sphere under three different lighting conditions. The sphere was assumed to be a fully reflective Lambertian sphere with lighting directions from the bottom left, top right and top left directions. The results of this rendering are shown below.

The next portion of the assignment involved inverting the lighting model to extract shape information of a face from seven different known lighting angles. The known images are unrolled and stacked into a 7XP matrix where P is the number of pixels in a single image. The known lighting directions are also stacked into a 3×7 matrix for each of the seven images. This results in the following equation.

N-dot-L Model Equation

Here, B is a 3xP matrix having the normal of each pixel per column. Since the albedo (a measure of the reflectance of the surface) also has to be considered, it is clubbed with the unit normals of each pixel. Hence the magnitude of each solved column vector in B would give us the surface albedo. This relation was converted into a linear system of equations and solved using a sparse linear system solver due to the large sizes of the matrices.

The resulting psuedo-normals (albedo and normal combined vectors) have been separately visualized after extracting the normals and albedos. The following figures visualize them.

From the image visualizing the normals, it can be observed that the surfaces indicated in blue are those with normals facing upward out of the screen plane, those in red facing horizontal directions and those in greed facing vertical directions in the screen plane. The normals were then integrated using the Frankot-Chellappa algorithm to obtain the surface up to an ambiguity. The resulting surface maps are shown below.

The final portion of the assignment involved estimating the pseudo-normals without information about the lighting directions. From the earlier expression relating pixel intensities to the normal directions of the surface we can convert the problem into a matrix factorization problem. Given the pixel intensity matrix, the problem is to factorize the matrix into the lighting direction matrix and the pseudo-normal matrix. This was done by decomposing the “I” matrix using SVD into UxSxV_t and enforcing the rank 3 condition by setting all diagonal elements of S except the top 3 as zero and re-composing it. A new decomposition was then recalculated to get the pseudo-normals and the lighting directions (U and V_t).

The new lighting directions were compared to the provided ground truth values and found to be off and were recomputed by composing the square root of the sigma matrix obtained during SVD with U and V_t. After enforcing the integrability conditions on the pseudo-normals and integrating using the Frankot-Chellapa algorithm, the following surface maps were obtained.

The differences in quality between the two reconstructions (with and without using the known lighting directions) can be seen in the two re-constructions.

All code related to the assignment is maintained in a private Github repo in order to honor academic integrity.