Part 1: Let’s get started!

To get started with Bonobo, you need to install it in a working python 3.5+ environment (you should use a virtualenv).

$ pip install bonobo

Check that the installation worked, and that you’re using a version that matches this tutorial (written for bonobo v.0.6.3).

$ bonobo version

See Installation for more options.

Create an ETL job

Since Bonobo 0.6, it’s easy to bootstrap a simple ETL job using just one file.

We’ll start here, and the later stages of the tutorial will guide you toward refactoring this to a python package.

$ bonobo init tutorial.py

This will create a simple job in a tutorial.py file. Let’s run it:

$ python tutorial.py
Hello
World
 - extract in=1 out=2 [done]
 - transform in=2 out=2 [done]
 - load in=2 [done]

Congratulations! You just ran your first Bonobo ETL job.

Inspect your graph

The basic building blocks of Bonobo are transformations and graphs.

Transformations are simple python callables (like functions) that handle a transformation step for a line of data.

Graphs are a set of transformations, with directional links between them to define the data-flow that will happen at runtime.

To inspect the graph of your first transformation:

Note

You must install the graphviz software first. It is _not_ the python’s graphviz package, you must install it using your system’s package manager (apt, brew, …).

For Windows users: you might need to add an entry to the Path environment variable for the dot command to be recognized

$ bonobo inspect --graph tutorial.py | dot -Tpng -o tutorial.png

Open the generated tutorial.png file to have a quick look at the graph.

digraph { rankdir = LR; "BEGIN" [shape="point"]; "BEGIN" -> {0 [label="extract"]}; {0 [label="extract"]} -> {1 [label="transform"]}; {1 [label="transform"]} -> {2 [label="load"]}; }

You can easily understand here the structure of your graph. For such a simple graph, it’s pretty much useless, but as you’ll write more complex transformations, it will be helpful.

Read the Code

Before we write our own job, let’s look at the code we have in tutorial.py.

Import

import bonobo

The highest level APIs of Bonobo are all contained within the top level bonobo namespace.

If you’re a beginner with the library, stick to using only those APIs (they also are the most stable APIs).

If you’re an advanced user (and you’ll be one quite soon), you can safely use second level APIs.

The third level APIs are considered private, and you should not use them unless you’re hacking on Bonobo directly.

Extract

def extract():
    yield 'hello'
    yield 'world'

This is a first transformation, written as a python generator, that will send some strings, one after the other, to its output.

Transformations that take no input and yields a variable number of outputs are usually called extractors. You’ll encounter a few different types, either purely generating the data (like here), using an external service (a database, for example) or using some filesystem (which is considered an external service too).

Extractors do not need to have its input connected to anything, and will be called exactly once when the graph is executed.

Transform

def transform(*args):
    yield tuple(
        map(str.title, args)
    )

This is a second transformation. It will get called a bunch of times, once for each input row it gets, and apply some logic on the input to generate the output.

This is the most generic case. For each input row, you can generate zero, one or many lines of output for each line of input.

Load

def load(*args):
    print(*args)

This is the third and last transformation in our “hello world” example. It will apply some logic to each row, and have absolutely no output.

Transformations that take input and yields nothing are also called loaders. Like extractors, you’ll encounter different types, to work with various external systems.

Please note that as a convenience mean and because the cost is marginal, most builtin loaders will send their inputs to their output unmodified, so you can easily chain more than one loader, or apply more transformations after a given loader.

Graph Factory

def get_graph(**options):
    graph = bonobo.Graph()
    graph.add_chain(extract, transform, load)
    return graph

All our transformations were defined above, but nothing ties them together, for now.

This “graph factory” function is in charge of the creation and configuration of a bonobo.Graph instance, that will be executed later.

By no mean is Bonobo limited to simple graphs like this one. You can add as many chains as you want, and each chain can contain as many nodes as you want.

Services Factory

def get_services(**options):
    return {}

This is the “services factory”, that we’ll use later to connect to external systems. Let’s skip this one, for now.

(we’ll dive into this topic in Part 4: Services)

Main Block

if __name__ == '__main__':
    parser = bonobo.get_argument_parser()
    with bonobo.parse_args(parser) as options:
        bonobo.run(
            get_graph(**options),
            services=get_services(**options)
        )

Here, the real thing happens.

Without diving into too much details for now, using the bonobo.parse_args() context manager will allow our job to be configurable, later, and although we don’t really need it right now, it does not harm neither.

Note

This is intended to run in a console terminal. If you’re working in a jupyter notebook, you need to adapt the thing to avoid trying to parse arguments, or you’ll get into trouble.

Reading the output

Let’s run this job once again:

$ python tutorial.py
Hello
World
 - extract in=1 out=2 [done]
 - transform in=2 out=2 [done]
 - load in=2 [done]

The console output contains two things.

  • First, it contains the real output of your job (what was print()-ed to sys.stdout).

  • Second, it displays the execution status (on sys.stderr). Each line contains a “status” character, the node name, numbers and a human readable status. This status will evolve in real time, and allows to understand a job’s progress while it’s running.

    • Status character:

      • “ ” means that the node was not yet started.

      • -” means that the node finished its execution.

      • +” means that the node is currently running.

      • !” means that the node had problems running.

    • Numerical statistics:

      • in=…” shows the input lines count, also known as the amount of calls to your transformation.

      • out=…” shows the output lines count.

      • read=…” shows the count of reads applied to an external system, if the transformation supports it.

      • write=…” shows the count of writes applied to an external system, if the transformation supports it.

      • err=…” shows the count of exceptions that happened while running the transformation. Note that exception will abort a call, but the execution will move to the next row.

However, if you run the tutorial.py it happens too fast and you can’t see the status change. Let’s add some delays to your code.

At the top of tutorial.py add a new import and add some delays to the 3 stages:

import time

def extract():
    """Placeholder, change, rename, remove... """
    time.sleep(5)
    yield 'hello'
    time.sleep(5)
    yield 'world'


def transform(*args):
    """Placeholder, change, rename, remove... """
    time.sleep(5)
    yield tuple(
        map(str.title, args)
    )


def load(*args):
    """Placeholder, change, rename, remove... """
    time.sleep(5)
    print(*args)

Now run tutorial.py again, and you can see the status change during the process.

Wrap up

That’s all for this first step.

You now know:

  • How to create a new job (using a single file).

  • How to inspect the content of a job.

  • What should go in a job file.

  • How to execute a job file.

  • How to read the console output.

It’s now time to jump to Part 2: Writing ETL Jobs.