Aligning images – an engineer’s solution

Recently I was struggling with the fact that one of the datasets, that I was working with had the same images, but they were not correctly aligned. Since one of them had location annotations that I used for training a Music Object Detector, I had align them somehow.

For getting an impression on the alignment-error, look at the following images:

The top-left image is the binarized image which serves as reference. The top-right image is the original gray-scale image that is misaligned a tiny little bit, which you can’t even notice from just looking at them. So I’ve generated the bit-wise difference between the two images which is shown at the bottom and there you can almost read the full scores, because they are slightly shifted and misplaced.

Generating such a diff-image from two images in Python with the Pillow library basically boils down to:

from PIL import Image, ImageChops
diff_image = ImageChops.difference(,

allowing me to visually verify whether or not the images were aligned correctly.

Turns out, that almost every image in the dataset was transformed a little bit. Since the dataset contains 1000 images in multiple flavors, I needed some automation. As you can notice, the images are not very far apart from each other. So upon searching for a clever solution, I found a nice blog entry which attempts to align color channels of images, that are slightly misaligned, by applying an iterative algorithm to find an affine transformation (which is generally a very hard task). Luckily, that algorithms is readily implemented in OpenCV and is called cv2.findTransformECC. Using it is almost newbie-friendly:

from cv2 import cv2, countNonZero, cvtColor

im1 = cv2.imread(path_to_desired_image)
im2 = cv2.imread(path_to_image_to_warp)

warp_mode = cv2.MOTION_AFFINE
warp_matrix = np.eye(2, 3, dtype=np.float32)

# Specify the number of iterations.
number_of_iterations = 100

# Specify the threshold of the increment in the correlation 
# coefficient between two iterations
termination_eps = 1e-7

            number_of_iterations, termination_eps)

# Run the ECC algorithm. The results are stored in warp_matrix.
(cc, warp_matrix) = cv2.findTransformECC(im1, im2, warp_matrix, 
                                         warp_mode, criteria)

Lastly, one “only” needs to warp the image with the found affine transformation:

aligned_image = cv2.warpAffine(
                          (sz[1], sz[0]), 
                          flags=cv2.INTER_LINEAR + cv2.WARP_INVERSE_MAP,

cv2.imwrite(destination_path, aligned_image)

The final result is remarkable. Can you still see the difference?

Diff of aligned images
Bitwise difference between the aligned images. A black pixel appears, where the images are not the same. Since the images are aligned, the image is almost completely white.

Just a few pixels remain, and these are because of errors during binarization of the image, which necessarily is a lossy operation. A cool side-effect is that the images are now not only aligned, but also have the same size.

The only things, I needed to tweak a little bit were the two parameters number_of_iterations and termination_eps. Both are required for the cv2.findTransformECC algorithm and specify the maximum time that it tries to find a solution and the required quality before stopping. When either is satisfied, the algorithm stops and returns the found solution. Letting the algorithm run for a few hours, yielded a perfectly aligned the dataset, which allows me now to go back to train my networks to detect musical objects.

If you are interested in the full source-code, you can find it in this Github repository.

The score images depicted in this article are from the CVC-MUSCIMA dataset by Alicia Fornés, Anjan Dutta, Albert Gordo and Josep Lladós, licensed under CC BY-NC-SA 4.0. More information on the dataset can also be found here as well as in their original paper.