# Common Tasks A quick reference to common tasks that can be performed with the pipelime CLI. ## Get Help How to get help and debug your configuration. | Command | Description | | ---- | ---- | | `pipelime` | Base command help | | `pipelime lc/lo/lst/` | List built-in commands/operators/stages | | `pipelime -m module.path lc/lo/lst/` | List built-in commands/operators/stages defined in a python package or module (`.py` file as well!) | | `pipelime [-m module.path] help ` | Show help for a command, an operation etc, possibly from an external module | | `pipelime audit -c cfg.yaml ...` | Inspect the input configuration and context, showing, eg, imports, variables, symbols as well as configuration error, such as missing definitions. | | `pipelime -dv ...` | `-v` prints the configuration and the context as loaded from files and overridden by the cli, while `-d` **skip the execution** (dry-run). | ## Short Options Most complex command options can be written in a compact form, as specified in every command help. Here a general reference: | Option Type | Usual Compact Form | Corresponding Extended Definition | Notes | | ---- | ---- | ---- | ---- | | Input Dataset | `+i [,]` | `++input.folder [++input.skip_empty ]` | `` is an optional flag to skip missing samples. | | Output Dataset | `+o [,[,]]` | `++output.folder [++output.exists_ok [++output.serialization.override.DEEP_COPY null]]` | when `` is TRUE the output dataset will not contain remote references nor hard/soft links. | | Multiprocessing | `+g [,]` | `++grabber.num_workers [++grabber.prefetch ]` | Both values should be positive integers. | | Dataset Splits | `+s [,]`
`+s [,]` | `++splits.fraction [++splits.output.folder ]`
`++splits.length [++splits.output.folder ]` | `` must be between 0 and 1, while `` is a positive integer. One split may have `null` length to get all remaining samples. | ## Data Processing Common operations on datasets. To get the most out of any commands, please show the help with `pipelime help `. ### Copy, Zip, Concatenate, Sort | Description | Command | | ---- | ---- | | Deep copy (no links, no remotes) | `pipelime clone +i +o ,false,true` | | Reset indexes (remove missing samples) | `pipelime clone +i ,true +o ` | | Dataset concatenation | `pipelime cat +o +i +i +i ...` | | Sample zipping (items are merged) | `pipelime zip +o +i +i +i ...` | | Sort by classification score [^cscore] | `pipelime sort +i +o +k metadata.classification.score` | [^cscore]: Here we assume to have a metadata item such as ```yaml classification: score: 0.9 ... ... ``` ### Data Manipulation | Description | Command | | ---- | ---- | | Change the names of the item keys | `pipelime map +i +o +s.remap-key.remap. +s.remap-key.remap. ... +s.remap-key.remove_missing false` | | Format the names of the item keys | `pipelime map +i +o +s.format-key.key_format prefix_*_suffix` | | Change image type to jpeg and metadata to toml | `pipelime map +i +o +s.replace-item.key_item_map.image JpegImageItem +s.replace-item.key_item_map.metadata TomlMetadataItem` | | Leave only the image item in the dataset | `pipelime map +i +o +s.filter-keys.key_list image` | | Remove a list of keys from every sample | `pipelime map +i +o +s.filter-keys.key_list +s.filter-keys.key_list ... +s.filter-keys.negate` | | Apply a user-callable to each sample [^call] | `pipelime map +i +o +s.lambda.func $symbol(\"user.py:my_callable\")` | | Apply random augmentation on repetitions of the dataset [^alb] | `pipelime pipe +i +o +op.repeat.count +op.map.stage.albumentations.transform transformation.yaml +op.map.stage.albumentations.keys_to_targets.image image` | | Shuffle, but keep a reference to the original index [^pipe] | `pipelime pipe +i +o +op[0] enumerate +op[1] shuffle` | However, when operations become too complex, it's easier to write a configuration file. The following examples can be run with `pipelime exec -c ` _Change the names of the item keys_ ```yaml map: input: output: stage: remap-key: remap: : : ... remove_missing: false ``` _Change image type to jpeg and metadata to toml_ ```yaml map: input: output: stage: replace-item: key_item_map: image: JpegImageItem metadata: TomlMetadataItem ``` _Remove a list of keys from every sample_ ```yaml map: input: output: stage: filter-keys: key_list: - - ... negate: true ``` _Apply a user-callable to each sample [^call]_ ```yaml map: input: output: stage: lambda: func: $symbol("user.py:my_callable") ``` _Apply random augmentation on repetitions of the dataset [^alb]_ ```yaml pipe: input: output: operations: repeat: count: map: stage: albumentations: transform: transformation.yaml keys_to_targets: image: image ``` _Shuffle, but keep a reference to the original index [^pipe]_ ```yaml pipe: input: output: operations: - enumerate - shuffle ``` [^call]: If you want to run on multiple processes, the callable must be picklable and referenced through a python class path, eg `my_module.my_callable`. [^alb]: The `transformation.yaml` file must be a valid [albumentation pipeline](https://albumentations.ai/docs/examples/serialization/#serializing-an-augmentation-pipeline-to-a-json-or-yaml-file). [^pipe]: The original index is stored in the `~idx` key as `TxtNumpyItem`. ### Split | Description | Command | | ---- | ---- | | Train/test/val splits | `pipelime split +i +s 0.8,train +s 0.1,test +s null,val` | | Dataset shuffling, subsampling (1 every 3) and reduction (half of the length) | `pipelime split +i +s 0.5, +shf +ss 3` | | Split by (a boolean) query | `pipelime split-query +i +q +os +od ` | | Split by value (a new dataset for each value of a given item) | `pipelime split-value +i +k +o ` | ### Piper Here we assume the context file is named `context*.[yaml|yml|json]` and placed in the same folder of the configuration file. | Description | Command | | ---- | ---- | | Run a dag from config and context files | `pipelime run -c ` | | Run a only a subset of nodes (must be included and not excluded) | `pipelime run -c +i node_1 +i node_2 ... +e node_2 +e node_3 ...` | | Show a dag (needs `Graphviz`, see [installation instructions](../get_started/installation.md)) | `pipelime draw -c ` | | Show a dag using the [Mermaid](https://mermaid-js.github.io/mermaid/) backend | `pipelime draw -c +b mermaid` | | Show a dag with (`+c`) full command names and (`+m`) limited data names' width (any backend) | `pipelime draw -c +c +m 30` | | Show a dag anonymizing paths | `pipelime draw -c +m "/" +ep start` | | Show a dag at high resolution (`Graphviz` only, see [installation instructions](../get_started/installation.md)) | `pipelime draw -c +x.G dpi=300` | | Save a dag to png (any backend) | `pipelime draw -c +o dag.png` | | Save a dag to svg or pdf (`Graphviz` only, see [installation instructions](../get_started/installation.md)) | `pipelime draw -c +o [dag.svg, dag.pdf]` | | Save a dag to markdown ([Mermaid](https://mermaid-js.github.io/mermaid/) only) | `pipelime draw -c +o dag.md +b mermaid` | ## Utilities | Description | Command | | ---- | ---- | | Generate a toy dataset of elements | `pipelime toy_dataset +o +t.length ` | | Generate a toy dataset of elements with 64x64 images | `pipelime toy_dataset +o +t.length +t.image_size 64` | | Generate a toy dataset of elements, then shuffle it | `pipelime pipe +o +op[0]toy_dataset.length 10 +op[1] shuffle` | | Measure the time to get an image out of a complex data pipeline [^timeit] | `pipelime timeit +i +o +op pipeline.yaml` | | Run a general shell command, eg, `paste` two files together | `pipelime shell +c "paste {f0} {f1} > {fout}" +i.f0 +i.f1 +o.fout ` | [^timeit]: Note that: 1. the output dataset is optional 2. the `pipeline.yaml` file contains the usual `mapping` or `list of mappings` defining the pipeline in the `pipe` command, eg: ```yaml - enumerate - shuffle ... ``` ## Schema Validation First, generate a new schema from a dataset: ```bash $ pipelime validate +i ``` Then, copy-paste the output yaml in your configuration file. ## Remote Data Lakes | Description | Command | | ---- | ---- | | Upload to a S3 remote bucket (user and password) | `pipelime remote-add +i +o +r s3://user:password@host:port/bucket` | | Upload to a S3 remote bucket (using aws config files) | `pipelime remote-add +i +o +r s3://host:port/bucket` | | Upload to a (shared/mounted) folder remote (linux) | `pipelime remote-add +i +o +r file://localhost/path/to/folder` | | Upload to a (shared/mounted) folder remote (windows) | `pipelime remote-add +i +o +r file://localhost/x:/path/to/folder` | | Upload only a set of item keys | `pipelime remote-add +i +o +r s3://host:port/bucket +k key_1 +k key_2 ...` | | Upload only the last samples | `pipelime remote-add +i +o +r s3://host:port/bucket +start ` | | Upload every samples from to | `pipelime remote-add +i +o +r s3://host:port/bucket +start +stop +step ` | | Remove a remote reference from a dataset. If no other source is available, items are downloaded. The remote data lake is not touched. | `pipelime remote-remove +i +o +r s3://host:port/bucket` | | Remove a remote reference only from a set of item keys. | `pipelime remote-remove +i +o +r s3://host:port/bucket +k key_1 +k key_2 ...` |