PyDaF stands for PYthon DAta Flows. It is a Python package that implements data-driven processing pipelines.

While programming usually involves writing a sequence of instructions, PyDaF instead revolves around two major elements: data units, which carry actual data as well as some associated metadata, and processors, which match data units depending on their type and associated metadata before processing them.

It is up to the programmer to define data types and processor input specifications; however, the library will then automatically handle finding matching data units, locking them and running the appropriate processors.

The library is also capable of running multiple processors in parallel. For now, this can only be done locally, using threads. However, running PyDaf processing pipelines on multiple computers will be possible in the future.

Features

Written in pure Python. Version 2.6 is required.
Synchronous or threaded execution modes.
Support for multiple jobs.
Automatic parallelisation of tasks.
Data unit wrappers for common Python types.
MapReduce-like algorithms can be easily implemented.
Dead ends in a processing pipeline are detected automatically.
Exceptions in processors are caught and stored as the cause for a job's failure.

A note about version numbers

PyDaF is currently in a highly unstable state, as it is still under development. However, the version numbers will follow the scheme described below.

The first number corresponds to the stable release number (0 at the moment, as no stable versions have been released).
The second number corresponds to the current API; a stable version will always use 0 here.
The third number corresponds to a patch level; no incompatibilities with other patch levels should be present.

Copyright and license

This library and its associated documentation are copyright © by Emmanuel Benoît. It is subject to change without notice. PyDaF comes as-is, without warranty and in no event can Emmanuel Benoît be held liable for any problems resulting from the use of this software. The software is distributed under the MIT license.

Concepts

In order to understand and use the PyDaF library, it is necessary to know the different elements it manipulates. This section describes these various elements.

Data units Data units are the most basic elements manipulated by PyDaF. A data unit is an instance of a class that includes both actual data and PyDaF-specific metadata. The most important part of a data unit's metadata is the set of flags. A flag is a short string that can be associated to a data unit, indicating that it has undergone some transformation. Other metadata include the data unit's lifetime (an integer which is decreased when a data unit is processed and must never reach 0 - it prevents a job from entering an endless loop), as well as its modification or retirement status.

Data descriptions A data description consists in a type, as well as two optional elements: a set of flags the data being described must have, and a set of flags the data being described must not have. They can be used to retrieve sub-collections of specific data units from collections, or as part of data specifications.

Collections A collection is a set of data units. The metadata of the data units in a collection can be manipulated en masse. It is also possible to create named sub-collections that automatically match some descriptions, or to retrieve parts of the data units using the descriptions directly.

Data specifications Data specifications are very similar to data descriptions, but they also include indications about the cardinality of a set of data units. They are used to specify a processor's input.

Processors Processors are the active elements of a PyDaF processing pipeline. They are executed when data units matching their input specifications are found.

Specificity Descriptions, specifications and processors can be more or less specific. For example, a description matching the Integer data type is less specific than a description matching the Integer +someflag data type. A specification's specificity is computed from the description it encapsulates and from the cardinality it indicates, while a processor's specificity is computed from its various input specifications.

Pipelines, jobs and tasks A pipeline is a set of processors associated with a method of execution. A job is a top-level execution element. It is executed through a pipeline from some specific input data units and runs until either a dead-end is found, an exception is raised or a specific set of data units is produced. A task is a part of a job's execution - it associates a processor with specific data units.

Installation

The PyDaF distribution archive includes a setup.py script based on Python's distutils. Running the python setup.py install command will install the PyDaF package automatically. However, the documentation and examples will not be installed.

The following sub-directories can be found in the PyDaF distribution:

pydaf/ contains the PyDaF package itself. If you do not use the supplied installation script, you will have to install it manually. Install this directory somewhere in your Python search path. On most systems, you can use the lib/site-packages directory. Its exact location might vary depending on your specific Python installation. Alternatively, you may copy it to an arbitrary location and manually add this location to your Python search path (e.g. in the PYTHONPATH environment variable).
doc/ contains the TeX source for this documentation. You may use the file as-is, or compile it to a PDF, or generate HTML documentation from it.
examples/ contains a few examples.

[ NOCTERNITY ]

Features

A note about version numbers

Copyright and license

Concepts

Installation