Let’s get started!

Warning

This tutorial was written for Bonobo 0.5, while the current stable version is Bonobo 0.6.

Please be aware that some things changed.

A summary of changes is available in the migration guide from 0.5 to 0.6.

To begin with Bonobo, you need to install it in a working python 3.5+ environment, and you’ll also need cookiecutter to bootstrap your project.

$ pip install bonobo cookiecutter

See Installation for more options.

Create an empty project

Your ETL code will live in ETL projects, which are basically a bunch of files, including python code, that bonobo can run.

$ bonobo init tutorial

This will create a tutorial directory (content description here).

To run this project, use:

$ bonobo run tutorial

Write a first transformation

Open tutorial/main.py, and delete all the code here.

A transformation can be whatever python can call. Simplest transformations are functions and generators.

Let’s write one:

def transform(x):
    return x.upper()

Easy.

Note

This function is very similar to str.upper(), which you can use directly.

Let’s write two more transformations for the “extract” and “load” steps. In this example, we’ll generate the data from scratch, and we’ll use stdout to “simulate” data-persistence.

def extract():
    yield 'foo'
    yield 'bar'
    yield 'baz'

def load(x):
    print(x)

Bonobo makes no difference between generators (yielding functions) and regular functions. It will, in all cases, iterate on things returned, and a normal function will just be seen as a generator that yields only once.

Note

Once again, you should use the builtin print() directly instead of this load() function.

Create a transformation graph

Amongst other features, Bonobo will mostly help you there with the following:

  • Execute the transformations in independent threads

  • Pass the outputs of one thread to other(s) thread(s) inputs.

To do this, it needs to know what data-flow you want to achieve, and you’ll use a bonobo.Graph to describe it.

import bonobo

graph = bonobo.Graph(extract, transform, load)

if __name__ == '__main__':
    bonobo.run(graph)

digraph { rankdir = LR; stylesheet = "../_static/graphs.css"; BEGIN [shape="point"]; BEGIN -> "extract" -> "transform" -> "load"; }

Note

The if __name__ == ‘__main__’: section is not required, unless you want to run it directly using the python interpreter.

Execute the job

Save tutorial/main.py and execute your transformation again:

$ bonobo run tutorial

This example is available in bonobo.examples.tutorials.tut01e01, and you can also run it as a module:

$ bonobo run -m bonobo.examples.tutorials.tut01e01

Rewrite it using builtins

There is a much simpler way to describe an equivalent graph:

The extract() generator has been replaced by a list, as Bonobo will interpret non-callable iterables as a no-input generator.

This example is also available in bonobo.examples.tutorials.tut01e02, and you can also run it as a module:

$ bonobo run -m bonobo.examples.tutorials.tut01e02

You can now jump to the next part (Working with files), or read a small summary of concepts and definitions introduced here below.

Takeaways

① The bonobo.Graph class is used to represent a data-processing pipeline.

It can represent simple list-like linear graphs, like here, but it can also represent much more complex graphs, with forks and joins.

This is what the graph we defined looks like:

digraph { rankdir = LR; BEGIN [shape="point"]; BEGIN -> "iter(['foo', 'bar', 'baz'])" -> "str.upper" -> "print"; }

Transformations are simple python callables. Whatever can be called can be used as a transformation. Callables can either return or yield data to send it to the next step. Regular functions (using return) should be prefered if each call is guaranteed to return exactly one result, while generators (using yield) should be prefered if the number of output lines for a given input varies.

③ The Graph instance, or transformation graph is executed using an ExecutionStrategy. You won’t use it directly, but bonobo.run() created an instance of bonobo.ThreadPoolExecutorStrategy under the hood (the default strategy). Actual behavior of an execution will depend on the strategy chosen, but the default should be fine for most cases.

④ Before actually executing the transformations, the ExecutorStrategy instance will wrap each component in an execution context, whose responsibility is to hold the state of the transformation. It enables you to keep the transformations stateless, while allowing you to add an external state if required. We’ll expand on this later.

Concepts and definitions

  • Transformation: a callable that takes input (as call parameters) and returns output(s), either as its return value or by yielding values (a.k.a returning a generator).

  • Transformation graph (or Graph): a set of transformations tied together in a bonobo.Graph instance, which is a directed acyclic graph (or DAG).

  • Node: a graph element, most probably a transformation in a graph.

  • Execution strategy (or strategy): a way to run a transformation graph. It’s responsibility is mainly to parallelize (or not) the transformations, on one or more process and/or computer, and to setup the right queuing mechanism for transformations’ inputs and outputs.

  • Execution context (or context): a wrapper around a node that holds the state for it. If the node needs state, there are tools available in bonobo to feed it to the transformation using additional call parameters, keeping transformations stateless.