# Convert Data To Underfolder The [Underfolder](../get_started/underfolder.md) format is a general data representation introduced by Pipelime, so you will often need to convert data coming from other sources to the Underfolder format. In this tutorial, you will write a samples sequence generator to convert the `iris` dataset and you will learn how to call it from command line. ## The Iris Dataset The [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) is a classic dataset comprising 3 species of Iris of 50 instances each. For each instance, 4 features are provided: sepal length, sepal width, petal length, petal width. The scikit-learn package provides a function to load the iris dataset as a dictionary: ```python from sklearn.datasets import load_iris dataset = load_iris() ``` The output mapping contains the following keys: * `data`: a numpy array of shape `(n_samples, n_features)` containing the features * `target`: a numpy array of shape `(n_samples,)` containing the labels * `feature_names`: a list of strings containing the feature names * `target_names`: a list of strings containing the label names, ie, _setosa_, _versicolor_, _virginica_ Mapping such keys to the Underfolder format is straightforward: * `data`: each row of the array is a sample * `target`: each value is the label ID of a sample * `feature_names`: the name of the features can be saved once for all samples, so we will put them into a shared item * `target_names`: the name of the labels can be saved once for all samples, so we will put them into a shared item ## A Samples Sequence Generator To write a samples sequence generator for the Iris dataset, we need to: 1. subclass `SamplesSequence` with the `source_sequence` decorator 1. give it a _title_ which will be the name of the method to call (see below) 1. implement the methods `def size(self) -> int` and `def get_sample(self, idx: int) -> pls.Sample` Note that `SamplesSequence` is a pydantic model, so we must follow the [pydantic rules](https://docs.pydantic.dev/) when defining a class: ```python from typing import Mapping from pydantic import Field, PrivateAttr import numpy as np from sklearn.datasets import load_iris from pipelime.sequences import SamplesSequence, Sample, source_sequence import pipelime.items as pli @source_sequence class IridDataset(SamplesSequence, title="iris"): """Samples sequence generator for the Iris dataset.""" shared_items: bool = Field( True, description="Whether to include the shared items in each sample." ) _shared_sample: Sample = PrivateAttr() _data_mtx: np.ndarray = PrivateAttr() _target_mtx: np.ndarray = PrivateAttr() def __init__(self, **data): super().__init__(**data) dataset = load_iris() self._shared_sample = self._make_shared_sample(dataset) self._data_mtx = dataset["data"] self._target_mtx = dataset["target"] def _make_shared_sample(self, dataset) -> Sample: # A sample with shared items is created only once return Sample( # a mapping of item keys to items { # a YAML metadata item "names": pli.YamlMetadataItem( # the content of the YAML { "features": dataset["feature_names"], "targets": dataset["target_names"].tolist(), }, # the item is shared shared=True, ) } ) def size(self) -> int: return self._data_mtx.shape[0] def get_sample(self, idx: int) -> Sample: sample = Sample( # a mapping of item keys to items { # numpy items "data": pli.TxtNumpyItem(self._data_mtx[idx]), "target": pli.TxtNumpyItem(self._target_mtx[idx]), } ) if self.shared_items: # the shared sample is merged with the current sample sample = self._shared_sample.merge(sample) return sample ``` ## Iris To Underfolder (Python) When the module is imported, the `iris` method is automatically added to the `SamplesSequence` class, so you can call it as follows: ```python from pipelime.sequences import SamplesSequence from pipelime.cli import pl_print # create a samples sequence generator iris_seq = SamplesSequence.iris() # get the first sample sample = iris_seq[0] # print the sample pl_print(sample) ``` To add a writing operation after the generator, just call the `to_underfolder` method: ```python iris_seq = iris_seq.to_underfolder("iris_dataset") ``` where "iris_dataset" is the path to the Underfolder to create. Nothing has been written to disk yet, since the `to_underfolder` method is an operation applied to each sample you _grab_ from the sequence. Therefore, you can call `run` to iterate over the whole sequence: ```python iris_seq.run() ``` Do you need more processing power? Let pipelime do the job for you: ```python iris_seq.run(num_workers=4, prefetch=10) ``` To wrap up, here is the full code: ```python from pipelime.sequences import SamplesSequence iris_seq = SamplesSequence.iris().to_underfolder("iris_dataset") iris_seq.run(num_workers=4, prefetch=10) ``` ## Iris To Underfolder (Command Line) When working with the command line, you can add the new `iris` generator to a `pipe` command to write the full Underfolder to disk. First, to create a configuration file for the `pipe` command you need to know its arguments: ```bash $ pipelime pipe help ``` What you get is a table full with information, so let's dig into it: ![](../images/pipe_help.png) 1. The title reports the name of the command and the full signature 1. The table body describes each argument of the command: * Fields: the name and its alias, if any * Type: the expected type of the argument * Piper Port: wether the argument is an input, an output or a parameter * Default: wether the argument has a default value or must be provided by the user 1. The footer shows the full class path of the command class When using the command line, you must prefix its argument by either `++` or `+`. Instead, when writing a configuration file, those arguments become keys of a dictionary: ```yaml op: ... input: ... output: ... grabber: ... ``` Want to know more? Get a more verbose help with `$ pipelime pipe help -v` and `$ pipelime pipe help -vv`. However, things get cluttered quickly, since each argument has its own sub-option tree to set, so it's better to show only what we need. For example, copy-paste the full class path of `output` from the command signature to the help command line: ```bash $ pipelime help pipelime.commands.interfaces.OutputDatasetInterface ``` What you get is a table with the sub-options of the `output` argument, which can be added to the configuration file: ```yaml op: ... input: ... output: folder: path/to/output/folder grabber: ... ``` Now we put the `iris` generator in the operation chain and remove both the `input` and `grabber` arguments, since they are not needed: ```yaml op: - iris output: folder: path/to/output/folder ``` Finally, run the command: ```bash $ pipelime -m path/to/iris_module.py pipe -c iris.yaml ``` Where `path/to/iris_module.py` is the file containing the `IridDataset` class, while `iris.yaml` is the configuration file. Though this works, you may want to set the output folder on the command line, instead of hard-writing it in the configuration file. To do so, you can use the `+` prefix and the [pydash notation](https://pydash.readthedocs.io/en/latest/deeppath.html) to set the value of a nested key: ```bash $ pipelime -m path/to/iris_module.py pipe -c iris.yaml ++output.folder path/to/output/folder ``` ```{hint} The `output` argument has a compact form `[,[,]]`, so you can also write `+o path/to/output/folder` or `+o path/to/output/folder,true` to overwrite an existing dataset. ``` ```{tip} You can use as many configuration files as you want: all options will be merged together. ```