Benchmark for Anonymous Video Analytics

Benchmark for Anonymous Video Analytics - Dataset

The dataset was collected in settings that mimic real-world signage-camera setups used for AVA. The dataset is composed of 16 videos recorded at different locations such as airports, malls, subway stations, and pedestrian areas. Outdoor videos are recorded at different times of day such as morning, afternoon, and evening. The dataset is recorded with Internet Protocol or USB fixed cameras with wide and narrow lenses to mimic the real-world use cases. Videos are recorded at 1920x1080 resolution and 30fps. The dataset includes videos of duration between 2 minutes and 30 seconds, and 6 minutes and 26 seconds, totaling over 78 minutes, with over 141,000 frames. The videos feature 34 professional actors.

A sample frame of each location is shown below. For the mall location, two videos are at different times: indoors (Mall-1/2) and outdoors (Mall-3/4).

Airport-1	Airport-2	Airport-3	Airport-4

Mall-1/2	Mall-3/4	Pedestrian-1	Pedestrian-2

Pedestrian-3	Pedestrian-4	Pedestrian-5	Subway-1

	Subway-2	Subway-3

Annotations

A professional team of annotators used Computer Vision Annotation Tool (CVAT) to fully annotate all videos with the following attributes: face and body of the people, identity, age, gender, attention, pose, orientation, and occlusions (figure below). Annotations are provided in xml files with CVAT format.

Dataset annotations

For preventing the analytics to focus on very small (far from signage) people, who are likely to have no OTS, and to simplify the annotation process, we define a region in some scenarios where people are omitted, and thus, not annotated. We refer to these regions as ignore areas, that are shown with a white shading in the sample frames. Further information in the paper.

The annotations maintain the identity of each person throughout the same video, even if the person exists and re-enters into the field of view and across all videos, and across videos.

Each video includes a range between 11 and 158 unique people. The dataset annotation includes a total of 785 unique people, and over 748,000 annotated bounding boxes of people.

Dataset summary:

Video			Length		Unique people		Number of localization annotations
Name	Daytime	Illuminance [Lux]	Time [min:sec]	Frames	All	OTS	People	Per-frame	Faces	Per-frame
Airport-1	-	500	5:21	9629	37	29	22062	2.4 ± 1.0	12832	1.4 ± 1.1
Airport-2	-	500	5:34	10008	35	29	23600	2.7 ± 1.4	14214	1.6 ± 1.2
Airport-3	-	500	6:26	11578	47	44	26704	2.4 ± 1.2	17849	1.6 ± 1.0
Airport-4	-	500	5:08	9247	61	56	43685	4.7 ± 2.0	17792	1.9 ± 1.2
Mall-1	-	300	4:38	8344	158	111	106852	12.8 ± 2.3	45835	5.5 ± 1.8
Mall-2	-	300	3:41	6626	145	105	95417	14.4 ± 3.7	42779	6.5 ± 2.3
Mall-3	-	800	5:25	9740	33	30	37120	3.8 ± 1.5	18906	1.9 ± 1.2
Mall-4	-	800	6:04	10931	53	50	47113	4.3 ± 1.6	32038	2.9 ± 1.3
Pedestrian-1	Afternoon	60000	5:40	10202	18	17	39680	4.0 ± 1.7	19859	2.0 ± 1.4
Pedestrian-2	Afternoon	40000	6:15	11262	56	40	58477	5.2 ± 1.7	25042	2.2 ± 1.6
Pedestrian-3	Midday-overcast	7000	5:41	10220	27	25	22738	2.3 ± 1.2	13915	1.4 ± 1.0
Pedestrian-4	Midday-shade	5500	4:32	8166	27	25	33031	4.0 ± 1.4	16248	2.0 ± 1.0
Pedestrian-5	Evening	250	2:58	5350	11	11	24476	4.6 ± 1.6	13504	2.5 ± 1.8
Subway-1	-	180	3:13	5795	17	17	36828	6.5 ± 3.1	25884	4.6 ± 2.7
Subway-2	-	180	2:32	4549	29	28	45125	9.9 ± 2.8	24248	5.3 ± 2.4
Subway-3	-	200	5:45	10342	31	29	85358	8.5 ± 2.9	35460	3.6 ± 1.7
Overall	-	[180,60000]	78:53	141989	785	646	748266	5.4 ± 3.9	376405	2.7 ± 2.1

Intel is committed to respecting human rights and avoiding complicity in human rights abuses. See Intel's Global Human Rights Principles. Intel's products and software are intended only to be used in applications that do not cause or contribute to a violation of an internationally recognized human right.