Schema Validation¶
When you deploy a pipeline in production, you want to make sure that the input data is valid. Pipelime commands provide a standard way to define such schema both for input and output sequences. For example, this might be a configuration file for the clone command:
input:
folder: input/path
schema: # ☚ The "schema" argument of the "clone" command
sample_schema: # ☚ How a sample should look like
image: # ☚ The "image" item key
class_path: ImageItem # ☚ The expected item type (can be a base class)
is_optional: false # ☚ Whether to raise an error if the key is missing
is_shared: false # ☚ Whether the item should be shared
label:
class_path: TxtNumpyItem
is_optional: true
is_shared: false
camera:
class_path: MetaDataItem
is_optional: true
is_shared: true
ignore_extra_keys: false # ☚ Whether to raise an error if the sample has extra keys
lazy: false # ☚ Whether to check each sample when it is accessed or just once when building the sequence
max_samples: 0 # ☚ If "lazy" is False, at most "max_samples" are checked
output: output/path
You usually don’t have to write such schema by hand. Instead, if you have a dataset fulfilling a target schema, just call pipelime validate and copy-paste the output, possibly tweaking some values, for example:
$ pipelime validate +i input_dataset
>>>
...
📦 output_schema_def: yaml schema definition
input:
schema:
sample_schema:
image:
class_path: PngImageItem
is_optional: false
is_shared: false
label:
class_path: TxtNumpyItem
is_optional: true
is_shared: false
camera:
class_path: YamlMetaDataItem
is_optional: true
is_shared: true
ignore_extra_keys: false
lazy: true
max_samples: 0
...
Adding A Custom Validation Function¶
Though this simple definition may be enough for most cases, more complex validation schemes can be easily developed if you don’t mind writing a bit of code. For example, you may want to check that the image item has 3 channels, so you write a function that raises an exception if the validation fails and returns the parsed item otherwise:
from pipelime.items import ImageItem
def check_image_channels(image_item: ImageItem) -> ImageItem:
image = image_item()
if image.shape[2] != 3:
raise ValueError(f'Image has {image.shape[2]} channels, but 3 are expected')
return image_item
Then, you can add it to the schema definition:
input:
folder: input/path
schema:
sample_schema:
image:
class_path: ImageItem
is_optional: false
is_shared: false
validator: class.path.to.check_image_channels
...
Where the class.path.to.check_image_channels may be a path/to/script.py:check_image_channels as well. Note that the other checks, i.e., type check, whether is optional, wheter is shared, are still performed before calling your custom validator.
Custom Sample Validation¶
If you feel comfortable with pydantic you can even write a custom sample validator as a pydantic model. To validate a sample, pipelime tries to instantiate the model with the sample’s items as input keywords, so the previous example can be rewritten as:
from pydantic import BaseModel, Field, validator
import pipelime.items as pli
class CustomSampleSchema(BaseModel, extra="forbid"):
image: pli.ImageItem
label: pli.TxtNumpyItem = Field(default_factory=pli.TxtNumpyItem)
camera: pli.MetaDataItem = Field(default_factory=pli.MetaDataItem)
@pyd.validator("image")
def validate_image(cls, image_item: pli.ImageItem) -> pli.ImageItem:
if image_item.is_shared:
raise ValueError('Image must not be shared.')
image = image_item()
if image.shape[2] != 3:
raise ValueError(f'Image has {image.shape[2]} channels, but 3 are expected')
return image_item
@pyd.validator("label")
def validate_label(cls, label_item: pli.TxtNumpyItem) -> pli.TxtNumpyItem:
if label_item.is_shared:
raise ValueError('Label must not be shared.')
@pyd.validator("camera")
def validate_camera(cls, camera_item: pli.MetaDataItem) -> pli.MetaDataItem:
if not camera_item.is_shared:
raise ValueError('Camera must be shared.')
Then, replace the sample_schema with the model’s class path:
input:
folder: input/path
schema:
sample_schema: class.path.to.CustomSampleSchema
ignore_extra_keys: false # ☚ NB: This is ignored if you use a pydantic model
lazy: false
max_samples: 0
output: output/path
Of course, this approach becomes useful only when you want to perform complex checks, e.g., that the image item size is equal to a value declared inside the camera metadata item.
Piped Validation¶
Validation can be performed as a step of a pipeline as well. To this end, SamplesSequence provides the validate_samples method, which takes the sample_schema and a few other options as input, raising an exception if the validation fails. In this case, the schema must be provided as a pydantic model:
from pydantic import BaseModel
from pipelime.sequences import SamplesSequence
from pipelime.items import ImageItem, NumpyItem
class MiniMNISTSampleValidator(BaseModel):
image: ImageItem
label: NumpyItem
seq = SamplesSequence.from_underfolder("datasets/mini_mnist")
seq = seq.validate_samples(sample_schema=MiniMNISTSampleValidator, lazy=False, max_samples=1)