Non-Maximum Suppression

The concept of Non-Maximum Suppression (NMS) is pervasive in Deep Learning solutions, yet I have found it to be rarely discussed, and even less formalized. This is an attempt at demystifying and formalizing it as well as I can. This is an undertaking I have already have started in my thesis, Section 2.2.5, but now is an opportunity to make it more accessible.

Basically, NMS is useful when we want to obtain a sparse set of outputs from a model architecture that, for one or more reasons, does not do it naturally. Convolutional neural networks, for instance, are mostly good to implement all-to-one or all-to-all functions, but usually struggle to output pointy maps. Noise in the annotations, upsampling layers in the architectures, or simply the different kinds of uncertainties in the task to solve, can be at the source of some smoothness in the output maps preventing abrupt output transitions. Putting a threshold on such maps will inevitably lead to redundant predictions, unless there are none above it of course. NMS comes in as the ingredient that filters the all-to-all predictions into a sparse set with the redundant predictions killed.

A formula

I like to see NMS as very similar to the $\rm argmax$ operator, except it outputs the locations of all the local minimas of some score function above a certain threshold. It therefore makes sense to use similar notation.

In what I think to be the beautiful, general case, the NMS operator

$$ {\rm NMS}_{\mathcal D \mathcal N T}\ S = \{ x\in\mathcal D \mid S(x) \geq T \text{ and } S(x) = {\rm max}_{x’\in\mathcal N(x)} S(x’)\} $$

retains, across a domain $\mathcal D$, the locations $x\in \mathcal D$ with maximal score $S:\mathcal D \to \mathbb R$ inside some neighborhood $\mathcal N:\mathcal D \to 2^\mathcal D$, and whose scores are above some threshold $T\in \mathbb R$.

I like this very generic definition because it works with $\mathcal D$ being a domain in very different modalities: it can be the pixel space of a map, it can be coordinates of bounding boxes, or more generically, embeddings.

In the object detection literature, NMS has often been referred to for killing boxes having a certain Intersection-over-Union (IoU), e.g. 70%, with higher-scoring ones. Regarding embeddings, for instance in instance segmentation, the neigborhood comes more in the form of a ball. Panoptic-Deeplab, for instance, also implements NMS over a heatmap (in the pixel space) as a simple x == maxpool(x) operation, the neighborhood being a square max-pooling kernel.

Implementation details

Digging in implementation details, max-pooling (in pixel spaces) is highly efficient and allows cancelled predictions to still inhibit lower-score predictions in their neighborhood. Other spaces (embeddings and bounding boxes) most often rely on a sequential and inefficient algorithm whose cancelled predictions don’t cancel other predictions themselves, hence requiring a larger neighborhood. Interestingly, those subtleties are almost never been mentioned in the literature.

Feature or liability?

While some papers try really hard to remove NMS, my take is that we can perhaps keep it if we can make it efficient and sound. For instance, by NMSing outputs maps before using the sequential algorithm, given that CNN exhibit spatially coherent behavior anyway. With respect to soundness, It is possible to achieve a degree of synergy between the emergent embedding space and the NMS. Using arbitrary neighborhoods such as 70% IoU between overlapping objects is quite probably and undesired as it hardcodes the impossibility to output strongly overlapping objects at the heart of the method.

Instead, we can help the network separate the “peaks” of overlapping objects as much as required by the NMS to detect them. While we can do it spatially, for instance coming from Panoptic-Deeplab, i.e. by pushing “center of mass” away from each other, we can also change the meaning of the embeddings, i.e. by going towards more abstract embeddings, or keep the same meaning but add an extra dimension.

In doing so, however, we have to ensure that out scoring function $S$ stay adequate, in that it is friendly for the architecture to approximate. For instance, Panoptic-Deeplab’s seed map consists of Gaussian shapes at objects’ centers of mass. Pushing them away from each other makes the pattern less recognizable by convolutional neural networks.

Oh and sometimes, actually, we can even relieve the network from having to output such maps. If this kind of mathematical gymnastics feels interesting to you and you’d like to know more, don’t hesitate to read the Chapter 5 of my thesis!

TODOs

This article is going to get improved over time, mostly for citations and illustrations. Yet, I thought putting it out there, as-is, could already generate interest. ;-)