"What I cannot create, I do not understand." – R. Feynman
Kohitij Kar, Jonas Kubilius, Elias B. Issa, Kailyn Schmidt, James J. DiCarlo

Does the primate ventral stream need cortical feedback to compute rapid online image-by-image object identity?

SfN , Washington, DC (USA) , 2017-11-13 08:00 program link Cosyne 2017

Object identities across different images are represented in the pattern of neural responses in primate inferior temporal (IT) cortex. The algorithms that best approximate these neural responses in the primate (macaque) IT belong to the family of hierarchical convolutional neural networks (HCNN) with predominantly feedforward architectures. However, there is strong anatomical evidence of both local recurrent and long-range feedback connections within the primate ventral visual cortex. We hypothesized that the impact of these feedback connections would be most relevant at later time points in the stimulus driven IT responses. Therefore image representations that critically rely on these feedback computations will require additional processing time (beyond the initial evoked response at 70-100 ms; feedforward pass) to emerge in IT. To test this hypothesis, we measured neural activity (chronically implanted multielectrode arrays; 288 electrodes/monkey) from IT cortex in two monkeys, while they simultaneously performed an image by image object identity estimation task (~3000 images, each containing 1 of 10 possible objects, randomly interleaved to neutralize attention). We first observed that monkeys outperform most HCNNs (e.g. AlexNet, VGG, GoogleNet) on a significant number of images (‘challenge images’). Consistent with previous results, we observed that the top layers of performance optimized HCNNs predict ~50% of IT neural variance during the feedforward pass. However, their predictions significantly worsened ( < 20% explained variance) at later time points (140-200 ms) from the image onset. Taken together, this suggests that, during these later time points, monkeys might be benefitting from additional computations from feedback and lateral connections (unavailable in the feedforward HCNNs which results in their poor prediction of IT responses) that help boost their object identification performances over that of HCNNs. Consistent with this hypothesis, we also observed that object identity decodes from IT neural populations for the challenge images took ~20-30ms longer to emerge (peaking around 150-180 ms from stimulus onset) compared to images where monkeys and HCNNs perform equally ('control images'). These observed neural decoder latency differences were not explained by individual neural response latencies or low-level image property differences like contrast, luminance or spatial frequency. These results imply the importance of feedback in ventral stream object inference, and the observed image-by-image differences constrain the next generation of ventral stream models.