Color Research and Application, Vol.46, No.2, 319-331, 2021
Subjective evaluation of colourized images with different colorization models
Two psychophysical experiments were conducted to evaluate the performance of grayscale image colorization models, and to verify the objective image quality metrics adopted in grayscale image colorization. Twenty representative grayscale images were colourized by four colorization models and three typical metrics, root mean square error (RMSE), peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM), were used to characterize the objective quality of the colourized images. Forty observers were asked to evaluate those images based on their subjective preference in a pair-comparison experiment, and to evaluate the perceived similarity between the generated and reference colour images using a seven-point rating scale. The experimental results indicate that different colorization models and objective metrics exhibit different performance in different scenarios. Each colorization method has its own advantages and disadvantages while none of the tested models performed well for all images. For preference, the model proposed by Iizuka et al based on ImageNet performed better while for perceived similarity the models proposed by Zhang et al and Iizuka et al, also based on ImageNet, outperformed the models of Larsson et al and Iizuka et al which were based on the Places dataset. Due to the fact that many objects have instances of distinct colour, a colorization algorithm cannot correctly reconstruct ground truth image for most gray level images, although it was found that perceived similarity and preference ratings of observers were correlated. In addition, it was found that the tested objective metrics correlated poorly with the subjective judgments of the human observers and their performance varied significantly with image content. These findings demonstrate the limitations of current image colorization studies, and it is suggested that due consideration must be given to human visual perception when evaluating the performance of colorization models.