Aligning images – an engineer’s solution

Recently I was struggling with the fact that one of the datasets, that I was working with had the same images, but they were not correctly aligned. Since one of them had location annotations that I used for training a Music Object Detector, I had to align them somehow.

For getting an impression on the alignment-error, look at the following images:

The top-left image is the binarized image which serves as a reference. The top-right image is the original gray-scale image that is misaligned a tiny little bit, which you can’t even notice from just looking at them. So I’ve generated the bit-wise difference between the two images which is shown at the bottom and there you can almost read the full scores because they are slightly shifted and misplaced.

Generating such a diff-image from two images in Python with the Pillow library basically boils down to:

from PIL import Image, ImageChops
diff_image = ImageChops.difference(,

allowing me to visually verify whether or not the images were aligned correctly.

Turns out, that almost every image in the dataset was transformed a little bit. Since the dataset contains 1000 images in multiple flavors, I needed some automation. As you can notice, the images are not very far apart from each other. So upon searching for a clever solution, I found a nice blog entry which attempts to align color channels of images, that are slightly misaligned, by applying an iterative algorithm to find an affine transformation (which is generally a very hard task). Luckily, that algorithm is readily implemented in OpenCV and is called cv2.findTransformECC. Using it is almost newbie-friendly:

from cv2 import cv2, countNonZero, cvtColor

im1 = cv2.imread(path_to_desired_image)
im2 = cv2.imread(path_to_image_to_warp)

warp_mode = cv2.MOTION_AFFINE
warp_matrix = np.eye(2, 3, dtype=np.float32)

# Specify the number of iterations.
number_of_iterations = 100

# Specify the threshold of the increment in the correlation 
# coefficient between two iterations
termination_eps = 1e-7

            number_of_iterations, termination_eps)

# Run the ECC algorithm. The results are stored in warp_matrix.
(cc, warp_matrix) = cv2.findTransformECC(im1, im2, warp_matrix, 
                                         warp_mode, criteria)

Lastly, one “only” needs to warp the image with the found affine transformation:

# Get the target size from the desired image
target_shape = im1.shape

aligned_image = cv2.warpAffine(
                          (target_shape[1], target_shape[0]), 
                          flags=cv2.INTER_LINEAR + cv2.WARP_INVERSE_MAP,

cv2.imwrite(destination_path, aligned_image)

The final result is remarkable. Can you still see the difference?

Diff of aligned images
Bitwise difference between the aligned images. A black pixel appears, where the images are not the same. Since the images are aligned, the image is almost completely white.

Just a few pixels remain, and these are because of errors during binarization of the image, which necessarily is a lossy operation. A cool side-effect is that the images are now not only aligned but also have the same size.

The only things, I needed to tweak a little bit where the two parameters number_of_iterations and termination_eps. Both are required for the cv2.findTransformECC algorithm and specify the maximum time that it tries to find a solution and the required quality before stopping. When either is satisfied, the algorithm stops and returns the found solution. Letting the algorithm run for a few hours, yielded a perfectly aligned the dataset, which allows me now to go back to train my networks to detect musical objects.

If you are interested in the full source-code, you can find it in this Github repository.

The score images depicted in this article are from the CVC-MUSCIMA dataset by Alicia Fornés, Anjan Dutta, Albert Gordo, and Josep Lladós, licensed under CC BY-NC-SA 4.0. More information on the dataset can also be found here as well as in their original paper.


6 thoughts on “Aligning images – an engineer’s solution

  1. Interesting but unfortunately not working as published. You should mention that we need gray images and you use sz[0] and sz[1] but what are these???


    1. Hi Juerg, thanks for your feedback. I’ve just updated the post to clarify what this sz is. I think I mentioned it more than once, that I’m working with grayscale images. Please refer to Github for the whole source-code:
      And what do you mean by “not working as published”?
      BTW: If you are just interested in the aligned images of the CVC-MUSCIMA dataset, I’ve uploaded them to Github:

      Hope this helps.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s