(preload) (preload)

PyDaF stands for PYthon DAta Flows. It is a Python package that implements data-driven processing pipelines.

While programming usually involves writing a sequence of instructions, PyDaF instead revolves around two major elements: data units, which carry actual data as well as some associated metadata, and processors, which match data units depending on their type and associated metadata before processing them.

It is up to the programmer to define data types and processor input specifications; however, the library will then automatically handle finding matching data units, locking them and running the appropriate processors.

The library is also capable of running multiple processors in parallel. For now, this can only be done locally, using threads. However, running PyDaf processing pipelines on multiple computers will be possible in the future.

Features

A note about version numbers

PyDaF is currently in a highly unstable state, as it is still under development. However, the version numbers will follow the scheme described below.

Copyright and license

This library and its associated documentation are copyright © by Emmanuel Benoît. It is subject to change without notice. PyDaF comes as-is, without warranty and in no event can Emmanuel Benoît be held liable for any problems resulting from the use of this software. The software is distributed under the MIT license.

Concepts

In order to understand and use the PyDaF library, it is necessary to know the different elements it manipulates. This section describes these various elements.

Data units Data units are the most basic elements manipulated by PyDaF. A data unit is an instance of a class that includes both actual data and PyDaF-specific metadata. The most important part of a data unit's metadata is the set of flags. A flag is a short string that can be associated to a data unit, indicating that it has undergone some transformation. Other metadata include the data unit's lifetime (an integer which is decreased when a data unit is processed and must never reach 0 - it prevents a job from entering an endless loop), as well as its modification or retirement status.

Data descriptions A data description consists in a type, as well as two optional elements: a set of flags the data being described must have, and a set of flags the data being described must not have. They can be used to retrieve sub-collections of specific data units from collections, or as part of data specifications.

Collections A collection is a set of data units. The metadata of the data units in a collection can be manipulated en masse. It is also possible to create named sub-collections that automatically match some descriptions, or to retrieve parts of the data units using the descriptions directly.

Data specifications Data specifications are very similar to data descriptions, but they also include indications about the cardinality of a set of data units. They are used to specify a processor's input.

Processors Processors are the active elements of a PyDaF processing pipeline. They are executed when data units matching their input specifications are found.

Specificity Descriptions, specifications and processors can be more or less specific. For example, a description matching the Integer data type is less specific than a description matching the Integer +someflag data type. A specification's specificity is computed from the description it encapsulates and from the cardinality it indicates, while a processor's specificity is computed from its various input specifications.

Pipelines, jobs and tasks A pipeline is a set of processors associated with a method of execution. A job is a top-level execution element. It is executed through a pipeline from some specific input data units and runs until either a dead-end is found, an exception is raised or a specific set of data units is produced. A task is a part of a job's execution - it associates a processor with specific data units.

Installation

The PyDaF distribution archive includes a setup.py script based on Python's distutils. Running the python setup.py install command will install the PyDaF package automatically. However, the documentation and examples will not be installed.

The following sub-directories can be found in the PyDaF distribution: