Benchmark for Anonymous Video Analytics - Dataset

Download Dataset

The dataset was collected in settings that mimic real-world signage-camera setups used for AVA. The dataset is composed of 16 videos recorded at different locations such as airports, malls, subway stations, and pedestrian areas. Outdoor videos are recorded at different times of day such as morning, afternoon, and evening. The dataset is recorded with Internet Protocol or USB fixed cameras with wide and narrow lenses to mimic the real-world use cases. Videos are recorded at 1920x1080 resolution and 30fps. The dataset includes videos of duration between 2 minutes and 30 seconds, and 6 minutes and 26 seconds, totaling over 78 minutes, with over 141,000 frames. The videos feature 34 professional actors.

A sample frame of each location is shown below. For the mall location, two videos are at different times: indoors (Mall-1/2) and outdoors (Mall-3/4).


Airport-1 Airport-2 Airport-3 Airport-4
Airport-1 Airport-1 Airport-1 Airport-1
Mall-1/2 Mall-3/4 Pedestrian-1 Pedestrian-2
Airport-1 Airport-1 Airport-1 Airport-1
Pedestrian-3 Pedestrian-4 Pedestrian-5 Subway-1
Airport-1 Airport-1 Airport-1 Airport-1
Subway-2 Subway-3
Airport-1 Airport-1

Annotations

A professional team of annotators used Computer Vision Annotation Tool (CVAT) to fully annotate all videos with the following attributes: face and body of the people, identity, age, gender, attention, pose, orientation, and occlusions (figure below). Annotations are provided in xml files with CVAT format.

Dataset annotations

For preventing the analytics to focus on very small (far from signage) people, who are likely to have no OTS, and to simplify the annotation process, we define a region in some scenarios where people are omitted, and thus, not annotated. We refer to these regions as ignore areas, that are shown with a white shading in the sample frames. Further information in the paper.

The annotations maintain the identity of each person throughout the same video, even if the person exists and re-enters into the field of view and across all videos, and across videos.

Each video includes a range between 11 and 158 unique people. The dataset annotation includes a total of 785 unique people, and over 748,000 annotated bounding boxes of people.


Dataset summary:

Video Length Unique people Number of localization annotations
Name Daytime Illuminance [Lux] Time [min:sec] Frames All OTS People Per-frame Faces Per-frame
Airport-1 - 500 5:21 9629 37 29 22062 2.4 ± 1.0 12832 1.4 ± 1.1
Airport-2 - 500 5:34 10008 35 29 23600 2.7 ± 1.4 14214 1.6 ± 1.2
Airport-3 - 500 6:26 11578 47 44 26704 2.4 ± 1.2 17849 1.6 ± 1.0
Airport-4 - 500 5:08 9247 61 56 43685 4.7 ± 2.0 17792 1.9 ± 1.2
Mall-1 - 300 4:38 8344 158 111 106852 12.8 ± 2.3 45835 5.5 ± 1.8
Mall-2 - 300 3:41 6626 145 105 95417 14.4 ± 3.7 42779 6.5 ± 2.3
Mall-3 - 800 5:25 9740 33 30 37120 3.8 ± 1.5 18906 1.9 ± 1.2
Mall-4 - 800 6:04 10931 53 50 47113 4.3 ± 1.6 32038 2.9 ± 1.3
Pedestrian-1 Afternoon 60000 5:40 10202 18 17 39680 4.0 ± 1.7 19859 2.0 ± 1.4
Pedestrian-2 Afternoon 40000 6:15 11262 56 40 58477 5.2 ± 1.7 25042 2.2 ± 1.6
Pedestrian-3 Midday-overcast 7000 5:41 10220 27 25 22738 2.3 ± 1.2 13915 1.4 ± 1.0
Pedestrian-4 Midday-shade 5500 4:32 8166 27 25 33031 4.0 ± 1.4 16248 2.0 ± 1.0
Pedestrian-5 Evening 250 2:58 5350 11 11 24476 4.6 ± 1.6 13504 2.5 ± 1.8
Subway-1 - 180 3:13 5795 17 17 36828 6.5 ± 3.1 25884 4.6 ± 2.7
Subway-2 - 180 2:32 4549 29 28 45125 9.9 ± 2.8 24248 5.3 ± 2.4
Subway-3 - 200 5:45 10342 31 29 85358 8.5 ± 2.9 35460 3.6 ± 1.7
Overall - [180,60000] 78:53 141989 785 646 748266 5.4 ± 3.9 376405 2.7 ± 2.1

Intel is committed to respecting human rights and avoiding complicity in human rights abuses. See Intel's Global Human Rights Principles. Intel's products and software are intended only to be used in applications that do not cause or contribute to a violation of an internationally recognized human right.