TL;DR

The paper outlines two easy-to-implement tips to improve your image classification test results:

Do your inference on the test set at a higher resolution than your train set
Fine-tune the last layers of your CNN classifier (i.e. the linear layer(s) after your pooling layer) at the higher test resolution

Overview

This article is a quick summary of 'Fixing the train-test resolution discrepancy' from Hugo Touvron, Andrea Vedaldi, Matthijs Douze, Hervé Jégou from Facebook AI Research, presented at NeurIPS 2019, with additional data from the note 'Fixing the train-test resolution discrepancy: FixEfficientNet' from the same authors

Results

Facebook AI Research (FAIR) used this technique to achieve a new SOTA result on Imagenet (88.5% top-1 accuracy) using EfficientNet (using extra data)

The authors also claim that it can enable faster training by training at a lower resolution while still attaining similr/better results

Our FixEfficientNet-L2 obtains a new state-of-the-art performance on ImageNet!
You can find all our new results on the FixRes additional note (https://t.co/mvY3EkGucR) and also on @paperswithcode and @sotabench
(In case you missed the FixRes paper : https://t.co/2NgQcrGDk5) pic.twitter.com/WiQtJQxdgT
— Hugo Touvron (@HugoTouvron) March 23, 2020

But Why?

Using typical training transforms such as RandomResizedCrop result in objects in training images appearing larger than they do in the test set. Have a look at the example from the paper below.

Our original image is resized to 224 x 224 before it is shown to the model. RandomResizedCrop is used to resize our training image (and add a little regularisation) while for the test image a simple center crop is taken. As a result of these different resizing methods, the size of the white horse in the top left training image is much larger than what would be shown to the model in the test set. It is this difference in object (e.g. horse) size that the authors say that their FixRes technique addresses

In other words:

...resizing the input images in pre-processing changes the distribution of objects sizes. Since different pre-processing protocols are used at training and testing time, the size distribution differs in the two cases.

How? - Two Tricks

Test at a Higher Resolution

Simply testing at a higher resolution should yield a performance improvement. Here, the authors show ImageNet top-1 test set accuracy trained at 224 x 224, you can see that the optimal test resolution was 288 x 288: (This behaviour was previously been shown in 2016 in "Identity Mappings in Deep Residual Networks"). Alternatively if you don't want to/cannot test at higher resolution, then training at a lower resolution is said to deliver the same accuracy, while enabling you to train faster (as you will be able to use a larger batch size with your smaller image resolutions)

Fine-tuning of later (classifier) layers of your CNN model

For the convolutional part of the CNN, comprising linear convolution, subsampling, ReLU, and similar layers, changing the input crop size is approximately transparent because the receptive field is unaffected by the input size. However, for classification the network must be terminated by a pooling operator (usually average pooling) in order to produce a fixed-size vector. Changing the size of the input crop strongly affects the activation statistics of this layer.

When fine-tuning, the authors recommend using test-time augmentation, not the previous training augmentation as it is simplest and performs well. Using training augmentations gave only slightly better results.

Similarity to Fast.ai's Progressive Resizing

Interestingly this technique is a little similar to Progressive Resizing, first espoused in the fast.ai deep learning course. The idea behind Progressive Resizing is that you first train at a lower resolution before increasing resolution and training again, albeit you're always training the entire network as opposed fine-tuning the classifier layers as described above. Nevertheless, it makes me wonder if both the FixRes and Progressive Resizing training techniques work via correcting for the same Train/Test object size mis-match?

Any thoughts, comments, suggestions I'd love to hear from you @mcgenergy on Twitter 😃