Main Entities

All of Pipelime functionalities are built on top of three basic concepts:

  • Items

  • Samples

  • Samples Sequences

Items

An item is basically a container for a single, generic data unit. It can contain whatever type of data you need and automatically handles some things for you, namely:

  • Validation

  • Serialization

  • File or Remote I/O

  • Caching

Currently, pipelime supports the following types of item:

  • Images, supporting some of the most commonly used formats: BmpImageItem, PngImageItem, JpegImageItem, TiffImageItem

  • Structured metadata: JsonMetadataItem, YamlMetadataItem, TomlMetadataItem

  • Numpy tensors: NpyNumpyItem, TxtNumpyItem

  • Generic pickle encoded python objects: PickleItem

  • Non-structured binary data: BinaryItem

Note that TiffImageItem may indeed manage any kind of multi-dimensional numpy array. We plan to extend the list of all supported item types and formats in the future, but in the meantime you are free to create and register your own items.

Samples

Usually, a dataset comprises multiple types of items for each observation. For example, consider a visual segmentation dataset with rgb images, ground-truth binary masks and classification labels. When you access an rgb image, you may also need to access its corresponding binary mask or its classification label, so it makes sense to consider that triplet as a single entity, which we call Sample, containing the three items.

Samples are collections of items, they behave as a python dictionary, mapping string keys to their corresponding items. Beside the plain mapping methods, they provide some utilities for, e.g.:

  • Validation

  • Deep access (in pydash fashion)

  • Key manipulation (change / rename / duplicate)

Samples Sequences

A sample sequence is the entity representing a full dataset, consisting of an ordered sequence of samples. It behaves as a python list, plus some utility methods for, e.g.:

  • Validation

  • Disk or Remote I/O

  • Manipulation

  • Data pipelining