Underfolder¶

By default, pipelime reads and writes data in the Underfolder format, a flexible file-system based dataset storage format. The main benefits of using the underfolder format are:

Flexibility: one single format for pretty much all the datasets that you will ever use.
Readability: no need to use special viewers to access the dataset, all data is stored as separate files with common extensions, making it easy to manually inspect.
Low disk usage: pipelime takes advantage of hard-links and item sharing to minimize disk usage.

An Underfolder dataset is a collection of samples. A sample is a collection of items. An item is a unitary block of data, i.e., a multi-channel image, a python object, a dictionary and more. The file system structure is here summarized:

underfolder structure

Any valid underfolder dataset must contain a subfolder named data with samples and items. Also, global shared items can be stored in the root folder. All items are stored as separate files, with a specific naming rule:

naming convention

Where:

$ID is the sample index, must be a unique integer for each sample.
ITEM is the item name.
EXT is the item extension.

We currently support many common file formats and others can be added by users:

.png, .jpeg/.jpg/.jfif/.jpe, .bmp for images
.tiff/.tif for multi-page images and multi-dimensional numpy arrays
.yaml/.yml, .json and .toml/.tml for metadata
.txt for numpy 2D matrix notation
.npy for general numpy arrays
.pkl/.pickle for picklable python objects
.bin for generic binary data

Root files follow the same convention but they lack the sample identifier part, i.e., $ITEM.$EXT