This document serves as an outline of the sample dataset sequence from Parallel Domain (PD).
Both the 3D world and scenario featured in this sequence are 100% automatically generated from OpenStreetMap data. Parallel Domain's procedural world generator created the environment and initial scenario with no manual intervention. From there, PD's agent simulation system takes over, driving traffic behavior automatically. This sequence is taken from a 25,000-frame dataset, again automatically generated, simulated, and rendered by PD's systems with no human intervention.
What's Included In the Sample Dataset
Location: San Francisco style urban environment with a mix of one-way and two-way streets and signaled intersections.
Camera Settings: Our camera settings on the sample sequences are calibrated to reflect the characteristics we typically see on real-world cameras with our customers. All of these settings are fully customizable, but by default we use a color look-up-table that emulates the desaturated, slightly over-exposed settings that perception teams tend to use. We have found that it is important to match these color settings to the real-world sensors in order to achieve the best results when training and testing computer vision models that need to perform on real-world data. As a result, the images more closely approximate the way a camera sees the world rather than chasing the most colorful, shiny graphics which are pleasing to the human eye.
Environmental Conditions: Mix of overcast, clear, and partially cloudy lighting environments, from mid-morning to mid-afternoon. Camera exposure settings range from slightly under to slightly over exposed to represent a range of potential real world conditions. Roads have 3-dimensional crowning and our dynamic agents (e.g. cars) have directional variation also to represent a range of potential real world conditions. Depending on your use case, these conditions can be heightened or removed.
Vehicles: Representative mix of common road vehicles (including light trucks, shipping vans and buses). No motorcycles or bicycles included in the sample dataset.
Pedestrians: A variety of stationary animated pedestrians in various configurations on sidewalks.
Sensors: Forward facing 1080p stereo camera pair, with separation distance of 0.8m.
Annotations: Per-pixel labels for semantic ids, instance ids (vehicle and pedestrian classes), depth, optical flow and surface normals. 2D and 3D bounding boxes for instanced classes (vehicle and pedestrian classes). Sensor calibration and pose information for ego vehicle and all sensors.
1) Small Sample Dataset (~800 frames) accessible via Google Drive: 2 contiguous sequences of 199 frames each, sampled at 10hz. Please note: this sample dataset is smaller and will not show the full breadth of conditions as noted above. Please get in touch with us to request access to the Larger Sample Dataset accessible via AWS.
2) Larger Sample Dataset (~5,200 frames) accessible via AWS: 13 contiguous sequences of 199 frames each, sampled at 10hz.
The above conditions have been chosen to give you a preview of Parallel Domain's data generation capabilities and are configurable when you engage with Parallel Domain.
The Parallel Domain sample dataset is provided in a format called DGP - Dataset Governance Protocol. Its goal is to provide a common format for the interchange of computer vision datasets, with a focus on autonomous vehicle applications. The protocol makes it simple to access and index the data within a dataset. As a dataset provider, Parallel Domain seeks to provide an easy-to-use format to consume our data – if your company has specific dataset format needs, please let our team know.
DGP is an open-source schema distributed under the permissive GPLv3 license, and is developed by a consortium of groups including Parallel Domain and Toyota Research Institute. For more information on licensing, please see the LICENSE file included with dgp_proto.zip attached at the bottom of the page.
The schema is defined by a set of protobuf based schema definitions, which can be used to produce interfaces to a variety of programming languages. If you are unfamiliar with protobuf, please see references here for more details: Google protobuf documentation. The schema files themselves are linked at the bottom of this document. DGP is nearing it's first public release, and once that goes live a suite of tools will be available for easily visualizing data in DGP format and incorporating it into various standard ML frameworks.
DGP is structured in a hierarchal fashion to enable easy interaction of data within the dataset for a variety of applications. Below, we will walk through the primary classes in the schema that are present in the PD sample dataset.
The top level structure in a DGP dataset is the Dataset object. This object encapsulates the common metadata for all data within the set and provides information on the provenance of the data itself.
A DGP dataset is made up of a set of Scenes. Each Scene represents a slice of data sourced from a single capture run, whether in the real world or virtually. The Scene objects contain references to all the datums within the dataset and can be used to quickly index any data element required.
A DGP Scene contains a list of samples that make up the Scene. Each Sample represents a sampling of the included sensor data at a specific point in time. The Sample stores references to individual Datum objects produced by each sensor at this timestep.
Additionally, each Sample record provides a key to a calibration file containing the calibration data for all sensors active in this sample. The calibration data is stored under the calibration folder, in files named <key name>.json. These files holds a serialized SampleCalibration object.
A Datum object contains a single data point from a single sensor in the sequence. Each Datum object can represent either an Image sourced from a camera sensor or a PointCloud produced by a LiDAR or other sensor, along with references to any associated annotations for the Datum. The current PD sample dataset only includes camera data at this time, however our systems are also capable of producing LiDAR data in various configurations.
Within each Datum, we store the ego vehicle pose at the relevant timestep inside the pose member. This record contains both the translation and rotation of the ego vehicle (relative to a fixed world coordinate frame). The rotation is stored as a quaternion, which can be easily converted to a matrix representation if desired. This ego pose information, along with the camera extrinsic pose stored in the calibration records, will enable the user to reconstruct the full pose of the relevant sensor in the world coordinate frame.
Each datum also provides link to the previous and next data in the sequence. This is very useful for iterating over data in sequence order.
The SampleCalibration object stores intrinsic and extrinsic calibration information for each sensor active in the Scene. Generally this calibration data is invariant over a sequence, so only a single calibration file will be present per sequence, however it is possible for sequences to contain calibration data that shifts over time, in which case multiple files will be present.
The extrinsic calibration information is stored inside a Pose object, containing the translation and rotation of the sensor relative to the vehicle coordinate frame. When combined with the pose information stored per sample, this data can be used to reconstruct the full sensor pose in the world coordinate frame.
The Image object stores a reference to the file path of the image data on disk (currently all images are stored in PNG format). The Image object also stores file references to various annotation data objects that are attached to the image. We support a variety of annotation types within our system, and the current sample dataset contains the following annotations for each image:
- BOUNDING_BOX_2D - 2d screen space bounding boxes for all instanced objects, including object class, instance id, visibility and truncation information.
- BOUNDING_BOX_3D - 3d camera space bounding boxes for all instanced objects, including object class, instance id, visibility and truncation information.
- SEMANTIC_SEGMENTATION_2D - Semantic segmentation labels for each pixel in the image. The ontology definition is included within the Dataset.
- INSTANCE_SEGMENTATION_2D - Instance segmentation labels for each instanced object class in the image. The current PD sample dataset includes instance labels for all vehicle and pedestrian objects.
- DEPTH - View space depth data for all pixels in the image. The depth data is stored as a compressed numpy array of 32 bit floating point values for easy consumption in Python.
- NORMALS - World space surface normal data for each pixel in the image. The normals are encoded into a 32 bit PNG image.
- MOTION_VECTORS - Screen space motion vectors (optical flow) for each pixel in the image. Motion vectors are encoded into a 32 bit PNG image.
Each annotation type will be discussed in more detail below.
The data within a DGP dataset is organized into a file tree for easy consumption by both humans and machines. The root folder contains the dataset definition file, along with a folder containing preview videos for each scene followed by a single folder for each scene in the dataset.
Within a scene folder, there are a variety of folders for each data type, including image data, calibration data and various forms of annotation data, with each subfolder containing a folder for each sensor in the scene that is relevant to the specific data type.
This results in a file structure as shown below:
--> <scene name>
--+ <scene preview videos>
--> <sensor 1 name>
--+ <rgb camera image data>
--> <sensor 2 name>
--> <sensor calibration data>
--> <annotation type>
--> <sensor 1 name>
--+ <annotation specific data>
--> <sensor 2 name>
Working with DGP Data
It is generally recommended to use the dataset.json and scene.json files to index the data contained within a DGP dataset rather than iterating over the file tree directly.
To iterate over the data within a specific scene, first parse the scene.json file for the scene of interest using the protobuf interfaces produced for the language of your choice. The top level object within the scene.json file will be of type Scene.
To begin iterating over the data, a user should first look to the samples member of the Scene object, which will contain an ordered list of all Samples within the Scene. Each Sample provides a list of keys to Datum objects containing the data for each sensor in the Scene (there will be one Datum key per sensor). Additionally, the Sample also provides a reference key to the calibration data for this Sample in the "calibration_key" member. This key can be used to index the data in the calibration folder, which provides intrinsic and extrinsic data for each sensor in the scene.
Using the Datum keys within each Sample, the user can then index the individual per-sensor Datum objects containing the relevant image data and annotation data of interest.
Currently, the top level dataset.json file only provides information that is common to the entire dataset, predominantly the ontology used to define the semantic and instance labels provided. Shortly, the Dataset object will be expanded to index the Scenes within the Dataset, but currently the user should iterate over Scenes within the dataset by walking the scene folders in the filesystem.
Image & Annotation Data Details
In this section, we will provide details on interpreting each specific data and annotation type.
RGB Image Data
RGB images are stored in the rgb folder, and represent fully processed camera data frames. Images are stored as 32 bit PNG files, and should be easily loadable with any image processing package.
2d bounding box data is stored in a single json file per frame of camera data in the bounding_box_2d folder. The json file contains a list of bounding box records, with a single bounding box for each object instance present in the frame. The box member of each annotation record provides the screen space bounds for the annotation, with additional members providing details on the object class id, instance id and vehicle type (if the object is a vehicle).
3d bounding box data is stored in a single json file per frame of camera data in the bounding_box_3d folder. As above, the json file contains a list of bounding box records. In this case, each record defines the pose of the 3d bounding box in the view space of the relevant camera. This pose information defines the translation and orientation of the bounding box, along with information on the width/height/length of the box, occlusion/truncation information and the relevant class/instance ids of the object.
Note that for DGP camera sensors, view space is defined as x-right, y-down and z increasing into the frame, and for DGP vehicle instances, their local coordinate frame is defined as x-forward, y-left, z-up.
Semantic segmentation labels are provided for each pixel in each camera image. Semantic labels are stored as 32 bit, 4 channel PNG files, stored in the segmentation folder. The semantic labels themselves are unsigned 16 bit integers, with the red channel storing the low byte and the green channel storing the high byte of the label. Details on the semantic labels are provided within the ontology definition contained in the Dataset object.
Instance segmentation labels are provided for each pixel in each camera image occupied by an instanced object class. As for semantic labels, instance labels are stored as 32 bit, 4 channel PNG files, stored in the instance folder. The instance labels are stored as unsigned 16 bit integers, with the red channel storing the low byte and the green channel storing the high byte of the label. This limits the number of instances in the scene to 65535 at a maximum.
As for semantic labels, please refer to the ontology contained within the Dataset for definitions of which specific object classes are instanced within the dataset. For our sample set, instances are limited to vehicles and pedestrians.
Depth data is stored as a compressed 2d numpy array of floating point values in the depth folder. The depth data is stored in meters, and the specific value stored is the "camera depth", which is equivalent to the z coordinate of an object projected into the view space of the camera. Note that this is different from the distance to the camera centre of projection for all pixels except the centre pixel in the image.
Surface normal information is encoded in 32 bit, 4 channel PNG files for easy access and stored in the normals folder. The information represents the surface normal direction in world space of each pixel in the image. To convert from 32 bit pixel data to normal vectors, use the following transformation (for pixel values r/g/b in the range [0,255]):
x = (r/255.0)*2.0 - 1.0
y = (g/255.0)*2.0 - 1.0
z = (b/255.0)*2.0 - 1.0
Motion vectors represent the screen space motion of every pixel in the image. The motion vectors are stored as two 16 bit integer values encoded into a single 32 bit, 4 channel PNG image, stored in the motion_vectors folder. To decode the motion vector (dx, dy) for a specific pixel in the image (for pixel values r/g/b/a in the range [0,255]), use the following transformation:
dx_i = r + g*256
dy_i = b + a*256
dx = ((dx_i / 65535.0)*2.0 - 1.0)*width
dy = ((dy_i / 65535.0)*2.0 - 1.0)*height
Given a motion vector dx, dy for a pixel i, j in an image at frame n, the same world space point will be visible at frame n+1 at the pixel i+dx, j+dx.
1. Google protobuf documentation: https://developers.google.com/protocol-buffers
2. DGP protobuf schema definition: [see attached file: dgp_proto.zip]