Recently I was struggling with the fact that one of the datasets, that I was working with had the same images, but they were not correctly aligned. Since one of them had location annotations that I used for training a Music Object Detector, I had to align them somehow.
For getting an impression on the alignment-error, look at the following images:
The top-left image is the binarized image which serves as a reference. The top-right image is the original gray-scale image that is misaligned a tiny little bit, which you can’t even notice from just looking at them. So I’ve generated the bit-wise difference between the two images which is shown at the bottom and there you can almost read the full scores because they are slightly shifted and misplaced.
Generating such a diff-image from two images in Python with the Pillow library basically boils down to:
from PIL import Image, ImageChops diff_image = ImageChops.difference(Image.open(image1_path), Image.open(image2_path)) diff_image.save(output_path)
allowing me to visually verify whether or not the images were aligned correctly.
Turns out, that almost every image in the dataset was transformed a little bit. Since the dataset contains 1000 images in multiple flavors, I needed some automation. As you can notice, the images are not very far apart from each other. So upon searching for a clever solution, I found a nice blog entry which attempts to align color channels of images, that are slightly misaligned, by applying an iterative algorithm to find an affine transformation (which is generally a very hard task). Luckily, that algorithm is readily implemented in OpenCV and is called cv2.findTransformECC. Using it is almost newbie-friendly:
from cv2 import cv2, countNonZero, cvtColor im1 = cv2.imread(path_to_desired_image) im2 = cv2.imread(path_to_image_to_warp) warp_mode = cv2.MOTION_AFFINE warp_matrix = np.eye(2, 3, dtype=np.float32) # Specify the number of iterations. number_of_iterations = 100 # Specify the threshold of the increment in the correlation # coefficient between two iterations termination_eps = 1e-7 criteria = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, number_of_iterations, termination_eps) # Run the ECC algorithm. The results are stored in warp_matrix. (cc, warp_matrix) = cv2.findTransformECC(im1, im2, warp_matrix, warp_mode, criteria)
Lastly, one “only” needs to warp the image with the found affine transformation:
# Get the target size from the desired image target_shape = im1.shape aligned_image = cv2.warpAffine( unaligned_image, warp_matrix, (target_shape[1], target_shape[0]), flags=cv2.INTER_LINEAR + cv2.WARP_INVERSE_MAP, borderMode=cv2.BORDER_CONSTANT, borderValue=0) cv2.imwrite(destination_path, aligned_image)
The final result is remarkable. Can you still see the difference?

Just a few pixels remain, and these are because of errors during binarization of the image, which necessarily is a lossy operation. A cool side-effect is that the images are now not only aligned but also have the same size.
The only things, I needed to tweak a little bit where the two parameters number_of_iterations and termination_eps. Both are required for the cv2.findTransformECC algorithm and specify the maximum time that it tries to find a solution and the required quality before stopping. When either is satisfied, the algorithm stops and returns the found solution. Letting the algorithm run for a few hours, yielded a perfectly aligned the dataset, which allows me now to go back to train my networks to detect musical objects.
If you are interested in the full source-code, you can find it in this Github repository.
The score images depicted in this article are from the CVC-MUSCIMA dataset by Alicia Fornés, Anjan Dutta, Albert Gordo, and Josep Lladós, licensed under CC BY-NC-SA 4.0. More information on the dataset can also be found here as well as in their original paper.