Multimedia Grand Challenges

Multimedia Grand Challenges

The purpose of the Multimedia Grand Challenge is to engage the multimedia research community by establishing well-defined and objectively judged challenge problems intended to exercise the state-of-the-art methods and inspire future research directions. The key criteria for Grand Challenges are that they should be useful, interesting, and their solution should involve a series of research tasks over a long period of time, with pointers towards longer-term research.

The Multimedia Grand Challenge proposals accepted for the ACM Multimedia 2020 edition are the following:

Social Media Prediction (SMP) Challenge

Continuing the series of Social Media Prediction (SMP) Challenges, the 2020 edition is seeking excellent research teams to provide new ways of forecasting problems and meaningfully improve people’s social lives and business scenarios. Social media popularity prediction is a massive-scale, multimodal and time-series forecasting problem that is central to various scenarios, such as online advertising, social recommendation, demand forecasting, etc. We released a large-scale Social Media Prediction Dataset (SMPD) with over 680K posts, 80K users and rich information including user profiles, images, texts, times, locations, and categories. This year the challenge will be again hosted by the joint team from multiple research organizations.

Deep Video Understanding Challenge

Deep video understanding is a difficult task which requires computer vision systems to develop a deep analysis and understanding of the relationships between different entities in video, to use known information to reason about other, more hidden information, and to populate a knowledge graph (KG) with all acquired information. The challenge for participating researchers will be: given a long duration movie, generate a knowledge-base about the characters and their relations (family, work, social, etc). To work on this challenge, a system should take into consideration all available modalities to push the limits of multimedia analysis to address analyzing long duration videos holistically and extract useful knowledge to utilize it in solving different kinds of queries.

BioMedia: Multimedia in Medicine

The 2020 BioMedia Grand Challenge tackles the challenge of predicting specific quality characteristics of human sperm using a multimodal dataset consisting of microscopic video recordings of human semen, associated sensor data, and participant-related data. The challenge presents four different tasks, two of which are optional, where each targets a different aspect of sperm quality assessment with the help of multimodal data analysis. The first two tasks ask participants to predict the motility (movement) and morphology (shape and structure) of spermatozoa (living sperm). The third task relates to looking at individual sperms to figure out which one moves the fastest (unsupervised). The fourth task challenges participants to create a multimedia application that aids medical experts in picking promising spermatozoa. Due to its novel use case, we hope to motivate researchers to have a look into the field of medical multimedia and contribute to the world of assisted reproduction.

AI Meets Beauty

The "AI Meets Beauty" Challenge 2020 aims at providing a large-scale image dataset of over half million images of beauty and personal care products, namely the Perfect-500K dataset, for participants to solve a challenging task: beauty and personal care product recognition. Particularly, given a real-world image containing one beauty or personal care item, the task is to match the real-world example of this item to the same item in the Perfect-500K data set. This is a practical but extremely challenging task, given the limitation that only images in a limited number from e-commerce sites are available in Perfect-500K and no real-world examples will be provided in advance.

Video Relation Understanding Challenge

The purpose of Video Relation Understanding (VRU) challenge is to push video content analysis at the relational and structural level, which is a predictable scheme of next AI-powered multimedia systems. This year’s challenge encourages participants to explore and develop innovative models and algorithms to detect object entities and the relationships between each pair of them in a given video. Specifically, all participants will join the main task of Visual Relation Detection and can optionally submit for Video Object Detection. The submissions will be evaluated by Mean Average Precision metrics on the large-scale video dataset, VidOR. We hope that these tasks can advance the foundation of future systems capable of performing complex inferences, and further bridge the gap between vision and language.

Pre-training for Video Captioning Challenge

The goal of this challenge is to offer a fertile ground for designing vision-language pre-training techniques that facilitate the vision-language downstream tasks (e.g., video captioning this year). Meanwhile, to further motivate and challenge the multimedia community, we provide a large-scale video-language pre-training dataset (namely “Auto-captions on GIF”) for contestants to solve this challenging but emerging task. Particularly, the contestants are asked to develop video captioning system based on Auto-captions on GIF dataset (as pre-training data) and the public MSR-VTT benchmark (as training data for downstream task). For the evaluation purpose, a contesting system is asked to produce at least one sentence of the test videos. The accuracy will be evaluated against human pre-generated sentence(s).

CitySCENE Anomaly Detection Challenge

The CitySCENE Challenge provides a large-scale city anomalous event detection dataset for researchers to benchmark their algorithms and thus contribute towards reproducible research. Participants are encouraged to address two tasks: (1) general anomaly detection, which classifies all anomalies in one group and all normal events in the other group; and (2) specific anomaly detection, which recognizes each of the anomalous activities. Frame-based receiver operating characteristic (ROC) curve and corresponding area under the curve (AUC) are used to evaluate the performance of the method. We hope the algorithms developed in this challenge can be applied to solving real-world problems in city management, public safety, traffic control, and environment protection etc.

Large-scale Human-centric Video Analysis in Complex Events

In this grand challenge, we focus on very challenging and realistic tasks of human-centric analysis in various crowd & complex events (such as earthquake escape, dining in canteen, getting-off train). We propose the largest existing dataset for understanding human motion, pose, and action in a variety of realistic events, which includes the currently largest number of poses (>1M), the largest number of complex-event action labels (>65k), and one of the largest number of trajectories with long terms (>1M with average trajectory length >500). Three challenging tasks are established on our dataset, including multi-person motion tracking, crowd pose estimation & tracking, and person-level action recognition. This challenge will encourage researchers to address the very challenging and realistic problems in human-centric analysis.