How do we recognize what we see? Despite the deceptive ease of perceiving things, explaining how we see turns out to be a supremely difficult task. Only recently advances in computer vision finally brought a class of models, known as deep neural nets, that are capable of matching human and non-human primate performance in several visual perception tasks. Our present aim is to develop these artificial systems further so that they would simultaneously (i) predict primate neural and behavioral responses at some given level of analysis, (ii) map well onto brain anatomy, and (iii) generalize to novel stimuli similarly to primates. I will first introduce Brain-Score, our composite benchmark for an extensive comparison of deep nets to primate ventral visual stream. Building on the insights gained by performing such benchmarking, I will describe CORnet family of models, aimed at providing models that commit to biological realities of the visual cortex. I will further extend our benchmarking to a much wider image set of images, including cartoons and paintings, to test the limits of generalization in humans and machines. Taken together, our approach establishes a good baseline deep neural network that could serve as a building block towards developing capable artificial cognitive agents.