Over the years, there have been a variety of visual reasoning tasks that evaluate machines’ ability to understand and reason about visual scenes. However, these benchmarks mostly focus on classification of objects and items that exist in a scene. Common sense reasoning – an understanding of what might happen next, or what gave rise to the scene – is often absent in these benchmarks. Humans, on the other hand, are highly versatile, adept in numerous high-level cognition-related visual reasoning tasks that go beyond pattern recognition and require common sense (e.g., physics, causality, functionality, psychology, etc).

In order to design systems with human-like visual understanding of the world, we would like to emphasize benchmarks and tasks that evaluate common sense reasoning across a variety of domains, including but not limited to:

Challenge Winners & Papers

Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang, Changwen Chen, The Lame Can’t Go Far: Visual Stream Limits The Video Question Answering

Alice Hein, Klaus Diepold, Winning Solution of the BIB MVCS Challenge 2022

Xin Huang, Jung Jae Kim, Hui Li Tan, Comparing classification and generation approaches to situated reasoning with vision-language pre-trained models

Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, Animesh Garg, SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Challenge Tracks

There will be six tracks in the machine vision common sense challenge:


This track is on the Physion dataset accepted by NeurIPS2021 benchmark track. The Physion dataset measures machines' ability to make predictions about commonplace real world physical events, and covers a wide variety of physical phenomena – rigid and soft-body collisions, stable multi-object configurations, rolling and sliding, projectile motion.

Download Link
Evaluation Server


This track is on the PTR dataset accepted by NeurIPS 2021. The PTR dataset is focused on common sense reasoning on parts and objects. It includes five types of questions: concept, relation (geometric & spatial), analogy, arithmetic and physics. PTR requires machines to answer these questions based on synthetic RGBD scenes.

Download Link
Evaluation Server


This track is on the AGENT benchmark accepted by ICML2021. AGENT is a dataset for Machine Social Common Sense. It consists of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios (goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs) that probe key concepts of core intuitive psychology.

Download Link
Evaluation Server


This track is the challenge on the CLEVRER accepted by ICLR2020 and ComPhy dataset accepted by ICLR 2022. CLEVRER is a diagnostic video dataset for systematic evaluation of computational models on a wide range of reasoning tasks. Motivated by the theory of human casual judgment, CLEVRER includes four types of question: descriptive (e.g., “what color"), explanatory (”what’s responsible for"), predictive (”what will happen next"), and counterfactual (“what if"). ComPhy takes a step further and requires machines to learn the new compositional visible and hidden physical properties from only a few examples. ComPhy includes three types of questions: factual questions for the composition between visible and hidden physical properties, counterfactual questions on objects’ physical properties like mass and charge, and predictive questions for objects’ future movement.

Downlaod Link (CLEVRER)
Downlaod Link (ComPhy)
Evaluation Server (CLEVRER)
Evaluation Server (ComPhy)


This track is on the BIB benchmark accepted by NeurIPS2021. The Baby Intuitions Benchmark (BIB) challenges machines to predict the plausibility of an agent's behavior based on the underlying causes of its actions.

Download Link
Evaluation Server


This track is the challenge on the STAR accepted by NeurIPS2021. Reasoning in the real world is not divorced from situations. A key challenge is to capture the present knowledge from surrounding situations and reason accordingly. STAR is a novel benchmark for Situated Reasoning, which provides challenging question-answering tasks, symbolic situation descriptions and logic-grounded diagnosis via real-world video situations.

Downlaod Link
Evaluation Server

Invited Speakers

Jiajun Wu

Leslie Kaebling

Nick Haber

Tao Gao

Moira Dillon

Jitendra Malik


Yining Hong

Fish Tung

Kevin Smith

Zhenfang Chen

Tianmin Shu

Elias Wang

Kanishk Gandhi

Bo Wu

Qinhong Zhou

Senior Organizers

Senior Organizers

Joshua B. Tenenbaum

Antonio Torralba

Dan Yamins

Judith Fan

Chuang Gan

Contact Info

E-mail: yininghong@cs.ucla.edu