When we think how to parse an image, we imagine a two-dimensional input array on which clever manipulations have to be applied. So we try to come up with strategies for edge extraction, grouping them, using that for figure-ground segmentation and so forth. Sometimes (often?) however this unnecessarily complicates things.
Consider an image on the right. Even performing some trivial figure-ground segmentation on it might be not a straightforward matter (I wouldn’t know where to start, actually). In their 2009 paper (pdf) and recently in VSS (thanks Johan), Ruth Rosenholtz and Nat Twarog put forward an idea that sometimes it is sufficient to blow up an image into a higher-dimensional space where relevant features would be represented. In the case of bananas, we could, for example, convert the image to greyscale and put the resulting luminance values in the third dimension of the image, as seen in the image below. Now you can easily see figure-ground segmentation with the (yellow) bananas being at the top, (brown) background on the bottom, and the table somewhat in between (red, green).
This is not stunning news, of course, it does not magically parse objects from the image. It is simply a different way of representing the same information. But notice how it simplifies the problem of parsing an image and enhances our understanding of the information present in the image. We can then think more clearly about ways to process this information.
Moreover, this idea quite naturally allows for grouping at multiple levels (hierarchical organization). Typically, there is no single grouping present in an image: bananas as a foreground and table plus wall as a background? Or is it bananas and table versus the wall? In the higher-dimensional representation, bananas go together if we only look at the top of the plot (group over luminance using a narrow filter). Bananas and table are grouped together using wider filters. Grouping scales emerge naturally in this representation and as Rosenholtz et al. (2009) show, this very simple approach can lead to a quite accurate parsing.
I think GMin should represent images immediately in higher dimensional spaces, allowing grouping mechanisms to act over various feature dimensions depending on the task at hand. Moreover, this representation seems as an early step of Jim DiCarlo’s “manifold untangling” framework. First we could blow up the space using the immediately available features (such as color and orientation), then using tricks we gradually transform it into an untangled representation with perceptually relevant features (such as objects and their poses).