This paper examines the extent to which large language models (LLMs) are able to perform tasks which require higher-order theory of mind (ToM)-the human ability to reason about multiple mental and emotional states in a recursive manner (e.g., I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite-Multi-Order Theory of Mind Q&A-and using it to compare the performance of five LLMs of varying sizes and training paradigms to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on our ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for higher-order ToM performance, and that the linguistic abilities of large models may support more complex ToM inferences. Given the important role that higher-order ToM plays in group social interaction and relationships, these findings have significant implications for the development of a broad range of social, educational and assistive LLM applications.
Journal article
2025-01-01T00:00:00+00:00
19
AI, large language models, mentalizing, social AI, social cognition, theory of mind