In this poster, I describe a conceptual framework of mid-level vision that relies on three key ideas. First, I argue that the visual system relies on intermediate representations of inputs for performing multiple tasks. Such representations are based on a largely pre-semantic (prior to categorization) image segmentation into surfaces, which may be partially overlapping to compensate for occlusions and are ordered in depth, and are refined iteratively across the visual hierarchy. Second, I propose that intermediate representations could be formed by computing similarity between features in local image patches, followed by pooling of highly-similar units, and repeated several times across the visual hierarchy. Finally, I suggest to use datasets composed of realistically rendered artificial objects and surfaces in order to better understand model’s behavior and obtain informative feedback. To support this approach, I also present results from several experiments in our lab.