Common Tasks¶
A quick reference to common tasks that can be performed with the pipelime CLI.
Get Help¶
How to get help and debug your configuration.
Command | Description |
|---|---|
| Base command help |
| List built-in commands/operators/stages |
| List built-in commands/operators/stages defined in a python package or module ( |
| Show help for a command, an operation etc, possibly from an external module |
| Inspect the input configuration and context, showing, eg, imports, variables, symbols as well as configuration error, such as missing definitions. |
|
|
Short Options¶
Most complex command options can be written in a compact form, as specified in every command help. Here a general reference:
Option Type | Usual Compact Form | Corresponding Extended Definition | Notes |
|---|---|---|---|
Input Dataset |
|
|
|
Output Dataset |
|
| when |
Multiprocessing |
|
| Both values should be positive integers. |
Dataset Splits |
|
|
|
Data Processing¶
Common operations on datasets. To get the most out of any commands, please show the help with pipelime help <command>.
Copy, Zip, Concatenate, Sort¶
Description | Command |
|---|---|
Deep copy (no links, no remotes) |
|
Reset indexes (remove missing samples) |
|
Dataset concatenation |
|
Sample zipping (items are merged) |
|
Sorting and Filtering¶
Description | Command |
|---|---|
Sort by classification score [1] |
|
| |
Filter by dictquery match |
|
Filter according to a callable |
|
Data Manipulation¶
Description | Command |
|---|---|
Change the names of the item keys |
|
Format the names of the item keys |
|
Change image type to jpeg and metadata to toml |
|
Leave only the image item in the dataset |
|
Remove a list of keys from every sample |
|
Apply a user-callable to each sample [3] |
|
Apply random augmentation on |
|
Shuffle, but keep a reference to the original index [5] |
|
However, when operations become too complex, it’s easier to write a configuration file. The following examples can be run with pipelime exec -c <config.yaml>
Change the names of the item keys
map:
input: <input>
output: <output>
stage:
remap-key:
remap:
<old_key_1>: <new_key_1>
<old_key_2>: <new_key_2>
...
remove_missing: false
Change image type to jpeg and metadata to toml
map:
input: <input>
output: <output>
stage:
replace-item:
key_item_map:
image: JpegImageItem
metadata: TomlMetadataItem
Remove a list of keys from every sample
map:
input: <input>
output: <output>
stage:
filter-keys:
key_list:
- <key_1>
- <key_2>
...
negate: true
Apply a user-callable to each sample [3]
map:
input: <input>
output: <output>
stage:
lambda:
func: $symbol("user.py:my_callable")
Apply random augmentation on
pipe:
input: <input>
output: <output>
operations:
repeat:
count: <N>
map:
stage:
albumentations:
transform: transformation.yaml
keys_to_targets:
image: image
Shuffle, but keep a reference to the original index [5]
pipe:
input: <input>
output: <output>
operations:
- enumerate
- shuffle
Split¶
Description | Command |
|---|---|
Train/test/val splits |
|
Dataset shuffling, subsampling (1 every 3) and reduction (half of the length) |
|
Split by (a boolean) query |
|
Split by value (a new dataset for each value of a given item) |
|
Piper¶
Here we assume the context file is named context*.[yaml|yml|json] and placed in the same folder of the configuration file.
Description | Command |
|---|---|
Run a dag from config and context files |
|
Run a only a subset of nodes (must be included and not excluded) |
|
Show a dag (needs |
|
Show a dag using the Mermaid backend |
|
Show a dag with ( |
|
Show a dag anonymizing paths |
|
Show a dag at high resolution ( |
|
Save a dag to png (any backend) |
|
Save a dag to svg or pdf ( |
|
Save a dag to markdown (Mermaid only) |
|
Utilities¶
Description | Command |
|---|---|
Generate a toy dataset of |
|
Generate a toy dataset of |
|
Generate a toy dataset of |
|
Measure the time to get an image out of a complex data pipeline [6] |
|
Run a general shell command, eg, |
|
Schema Validation¶
First, generate a new schema from a dataset:
$ pipelime validate +i <input>
Then, copy-paste the output yaml in your configuration file.
Remote Data Lakes¶
Description | Command |
|---|---|
Upload to a S3 remote bucket (user and password) |
|
Upload to a S3 remote bucket (using aws config files) |
|
Upload to a (shared/mounted) folder remote (linux) |
|
Upload to a (shared/mounted) folder remote (windows) |
|
Upload only a set of item keys |
|
Upload only the last |
|
Upload every |
|
Remove a remote reference from a dataset. If no other source is available, items are downloaded. The remote data lake is not touched. |
|
Remove a remote reference only from a set of item keys. |
|