# Schema Validation

When you deploy a pipeline in production, you want to make sure that the input data is valid.
Pipelime commands provide a standard way to define such schema both for input and output sequences. For example, this might be a configuration file for the `clone` command:

```yaml
input:
  folder: input/path
  schema:                           # ☚ The "schema" argument of the "clone" command
    sample_schema:                  # ☚ How a sample should look like
      image:                        # ☚ The "image" item key
        class_path: ImageItem       # ☚ The expected item type (can be a base class)
        is_optional: false          # ☚ Whether to raise an error if the key is missing
        is_shared: false            # ☚ Whether the item should be shared
      label:
        class_path: TxtNumpyItem
        is_optional: true
        is_shared: false
      camera:
        class_path: MetaDataItem
        is_optional: true
        is_shared: true
    ignore_extra_keys: false        # ☚ Whether to raise an error if the sample has extra keys
    lazy: false                     # ☚ Whether to check each sample when it is accessed or just once when building the sequence
    max_samples: 0                  # ☚ If "lazy" is False, at most "max_samples" are checked
output: output/path
```

You usually don't have to write such schema by hand. Instead, if you have a dataset fulfilling a target schema, just call `pipelime validate` and copy-paste the output, possibly tweaking some values, for example:

```bash
$ pipelime validate +i input_dataset
```

```bash
>>>
...
📦 output_schema_def: yaml schema definition
input:
  schema:
    sample_schema:
      image:
        class_path: PngImageItem
        is_optional: false
        is_shared: false
      label:
        class_path: TxtNumpyItem
        is_optional: true
        is_shared: false
      camera:
        class_path: YamlMetaDataItem
        is_optional: true
        is_shared: true
    ignore_extra_keys: false
    lazy: true
    max_samples: 0
...
```

## Adding A Custom Validation Function

Though this simple definition may be enough for most cases, more complex validation schemes can be easily developed if you don't mind writing a bit of code. For example, you may want to check that the `image` item has 3 channels, so you write a function that raises an exception if the validation fails and returns the parsed item otherwise:

```python
from pipelime.items import ImageItem

def check_image_channels(image_item: ImageItem) -> ImageItem:
    image = image_item()
    if image.shape[2] != 3:
        raise ValueError(f'Image has {image.shape[2]} channels, but 3 are expected')
    return image_item
```

Then, you can add it to the schema definition:

```yaml
input:
  folder: input/path
  schema:
    sample_schema:
      image:
        class_path: ImageItem
        is_optional: false
        is_shared: false
        validator: class.path.to.check_image_channels
...
```

Where the `class.path.to.check_image_channels` may be a `path/to/script.py:check_image_channels` as well. Note that the other checks, i.e., type check, whether is optional, wheter is shared, are still performed before calling your custom validator.

## Custom Sample Validation

If you feel comfortable with [pydantic](https://pydantic-docs.helpmanual.io/) you can even write a custom sample validator as a pydantic model. To validate a sample, pipelime tries to instantiate the model with the sample's items as input keywords, so the previous example can be rewritten as:

```python
from pydantic import BaseModel, Field, validator
import pipelime.items as pli

class CustomSampleSchema(BaseModel, extra="forbid"):
    image: pli.ImageItem
    label: pli.TxtNumpyItem = Field(default_factory=pli.TxtNumpyItem)
    camera: pli.MetaDataItem = Field(default_factory=pli.MetaDataItem)

    @pyd.validator("image")
    def validate_image(cls, image_item: pli.ImageItem) -> pli.ImageItem:
        if image_item.is_shared:
            raise ValueError('Image must not be shared.')
        image = image_item()
        if image.shape[2] != 3:
            raise ValueError(f'Image has {image.shape[2]} channels, but 3 are expected')
        return image_item

    @pyd.validator("label")
    def validate_label(cls, label_item: pli.TxtNumpyItem) -> pli.TxtNumpyItem:
        if label_item.is_shared:
            raise ValueError('Label must not be shared.')

    @pyd.validator("camera")
    def validate_camera(cls, camera_item: pli.MetaDataItem) -> pli.MetaDataItem:
        if not camera_item.is_shared:
            raise ValueError('Camera must be shared.')
```

Then, replace the `sample_schema` with the model's class path:

```yaml
input:
  folder: input/path
  schema:
    sample_schema: class.path.to.CustomSampleSchema
    ignore_extra_keys: false        # ☚ NB: This is ignored if you use a pydantic model
    lazy: false
    max_samples: 0
output: output/path
```

Of course, this approach becomes useful only when you want to perform complex checks, e.g.,
that the `image` item size is equal to a value declared inside the `camera` metadata item.

## Piped Validation

Validation can be performed as a step of a pipeline as well.
To this end, `SamplesSequence` provides the `validate_samples` method, which takes the `sample_schema` as input and raise an exception if the validation fails:

```python
from pydantic import BaseModel
from pipelime.sequences import SamplesSequence
from pipelime.items import ImageItem, NumpyItem
from pipelime.utils.pydantic_types import SampleValidationInterface

class MiniMNISTSampleValidator(BaseModel):
    image: ImageItem
    label: NumpyItem

seq = SamplesSequence.from_underfolder("datasets/mini_mnist")
seq = seq.validate_samples(
  sample_schema=SampleValidationInterface(
    sample_schema=MiniMNISTSampleValidator, lazy=False, max_samples=1
  )
)
```