Training neural networks to recognize speech increased their correspondence to the human auditory pathway but did not yield a shared hierarchy of acoustic features
Thompson JAF., Bengio Y., Formisano E., Schönwiesner M.
AbstractThe correspondence between the activity of artificial neurons in convolutional neural networks (CNNs) trained to recognize objects in images and neural activity collected throughout the primate visual system has been well documented. Shallower layers of CNNs are typically more similar to early visual areas and deeper layers tend to be more similar to later visual areas, providing evidence for a shared representational hierarchy. This phenomenon has not been thoroughly studied in the auditory domain. Here, we compared the representations of CNNs trained to recognize speech (triphone recognition) to 7-Tesla fMRI activity collected throughout the human auditory pathway, including subcortical and cortical regions, while participants listened to speech. We found no evidence for a shared representational hierarchy of acoustic speech features. Instead, all auditory regions of interest were most similar to a single layer of the CNNs: the first fully-connected layer. This layer sits at the boundary between the relatively task-general intermediate layers and the highly task-specific final layers. This suggests that alternative architectural designs and/or training objectives may be needed to achieve fine-grained layer-wise correspondence with the human auditory pathway.HighlightsTrained CNNs more similar to auditory fMRI activity than untrainedNo evidence of a shared representational hierarchy for acoustic featuresAll ROIs were most similar to the first fully-connected layerCNN performance on speech recognition task positively associated with fmri similarity