Over the years, there have been a variety of visual reasoning tasks that evaluate machines’ ability to understand and reason about visual scenes. However, these benchmarks mostly focus on classification of objects and items that exist in a scene. Common sense reasoning – an understanding of what might happen next, or what gave rise to the scene – is often absent in these benchmarks. Humans, on the other hand, are highly versatile, adept in numerous high-level cognition-related visual reasoning tasks that go beyond pattern recognition and require common sense (e.g., physics, causality, functionality, psychology, etc).

In order to design systems with human-like visual understanding of the world, we would like to emphasize benchmarks and tasks that evaluate common sense reasoning across a variety of domains, including but not limited to:

Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang, Changwen Chen, The Lame Can’t Go Far: Visual Stream Limits The Video Question Answering

Alice Hein, Klaus Diepold, Winning Solution of the BIB MVCS Challenge 2022

Xin Huang, Jung Jae Kim, Hui Li Tan, Comparing classification and generation approaches to situated reasoning with vision-language pre-trained models

Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, Animesh Garg, SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

This track is on the Physion dataset accepted by NeurIPS2021 benchmark track. The Physion dataset measures machines' ability to make predictions about commonplace real world physical events, and covers a wide variety of physical phenomena – rigid and soft-body collisions, stable multi-object configurations, rolling and sliding, projectile motion.

This track is on the PTR dataset accepted by NeurIPS 2021. The PTR dataset is focused on common sense reasoning on parts and objects. It includes five types of questions: concept, relation (geometric & spatial), analogy, arithmetic and physics. PTR requires machines to answer these questions based on synthetic RGBD scenes.

This track is on the AGENT benchmark accepted by ICML2021. AGENT is a dataset for Machine Social Common Sense. It consists of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios (goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs) that probe key concepts of core intuitive psychology.

This track is the challenge on the CLEVRER accepted by ICLR2020 and ComPhy dataset accepted by ICLR 2022. CLEVRER is a diagnostic video dataset for systematic evaluation of computational models on a wide range of reasoning tasks. Motivated by the theory of human casual judgment, CLEVRER includes four types of question: descriptive (e.g., “what color"), explanatory (”what’s responsible for"), predictive (”what will happen next"), and counterfactual (“what if"). ComPhy takes a step further and requires machines to learn the new compositional visible and hidden physical properties from only a few examples. ComPhy includes three types of questions: factual questions for the composition between visible and hidden physical properties, counterfactual questions on objects’ physical properties like mass and charge, and predictive questions for objects’ future movement.

This track is on the BIB benchmark accepted by NeurIPS2021. The Baby Intuitions Benchmark (BIB) challenges machines to predict the plausibility of an agent's behavior based on the underlying causes of its actions.

This track is the challenge on the STAR accepted by NeurIPS2021. Reasoning in the real world is not divorced from situations. A key challenge is to capture the present knowledge from surrounding situations and reason accordingly. STAR is a novel benchmark for Situated Reasoning, which provides challenging question-answering tasks, symbolic situation descriptions and logic-grounded diagnosis via real-world video situations.

