8/15/13

Introduction to high-level multithreading in C++ via Intel TBB library


"The free lunch is over!"
"Multithreading is hard!"
"Amdahl's law will find and punish you!.."

You've probably already heard these maxims. Moms use them to scare kids who are throwing spaghetti all over the table and using global variables all day long. 

And you know it's fair since by now you've learned (the hard way, presumably) that writing parallel software is indeed a complex endeavor. Or is it? 

There is a practical observation that a software problem's complexity is at least twofold:
  • "Inherent" or unavoidable complexity that is at the essence of the problem.
  • "Accidental" complexity which covers the work that does not really have much to do with the problem at hand but needs doing anyway.
People have been developing new programming techniques, higher level languages, design patterns, frameworks and methodologies; all in order to mitigate the accidental complexity and to be able to focus only on the essential parts, thus improving productivity and reducing development risks. 

Would it be possible to sweep under the rug some of the work that makes multithreaded programming so complex? 

It turns out that this is exactly what the Intel Threading Building Blocks (Intel® TBB) library can help with and so, to illustrate where this is leading, let us first consider a simple problem.

 

The Game of Life

Say we are implementing a classic Artificial Intelligence simulation, Conway's Game of Life, which is defined by simple rules. 

The "world" is a big rectangular array of cells, each cell being in one of two possible states: alive or dead. The initial state of the board is random and at each discrete "epoch" we recompute the state of every cell such that:
  • A live cell will die if it has less than two (under-population) or more than three (overcrowding) neighbors.
  • A dead cell with exactly three neighbors will come alive (reproduction).
  • Otherwise the state will remain unchanged.
The evolution rules for a single cell may be expressed by a C++ function:

static const int DEAD = 0, ALIVE = 1;
//  returns cell's new state base on the evolution rules
int evolve_cell(int cur_state, int num_neighbors) {
    if (cur_state == ALIVE) {
        if (num_neighbors < 2 ||  // under-population
            num_neighbors > 3)    // overcrowding
            return DEAD;
    } else {
        if (num_neighbors == 3) return ALIVE; //  reproduction
    }
    return cur_state;
}

The rules are applied to each cell on the board, i.e. we have a big for-loop where, for each cell, we compute its state for the next epoch based on the number of live neighbors:

//  evolves all cells one epoch, given the previous epoch state in prev_board
void game_of_life_board::evolve(const game_of_life_board& prev_board) {
    for (int i = 0; i < num_cells(); i++) {
        cell[i] = evolve_cell(prev_board.cell[i], prev_board.num_neighbors(i));
    }
}
 
Here is how a typical board might look after some iterations:



Although the algorithm is trivial, if there are many cells to simulate the processing time will add up and we might want to start optimizing.

 

Trying to optimize the code

The right thing to do here would be to try to improve the algorithm first (after having made sure that we are solving the right problem). If that does not help, then start to pay attention to things like memory access patterns, branch prediction, loop unrolling etc. Lets say we also hit a wall at this point. What do we do next?

When the program runs on an Intel i7 laptop, the CPU usage looks like this: 

There are 4 CPU cores but our computation only runs on a single core at time. In practice the operating system makes it "jump" between different cores due to the presence of other processes on the system and the specifics of the OS process scheduler. If we ask the system not to do that (by setting the process CPU affinity), the picture changes to: 

 
As you can see, the 4 CPU cores are largely under-utilized: we are effectively using only one core. This probably means that we could potentially enlist the three other cores and improve the speed. 

So, how do we approach this? 

One thing to notice about this for-loop is that the computation on a cell does not modify any state that the other cells' computations might be affected by. Such algorithms are sometimes called "embarrassingly parallel"; that is, splitting the problem into parallel tasks is relatively effortless. 

Let's try to leverage this fact.

 

An attempt at multithreading

The traditional and the most basic approach would be to:
  1. Somehow partition the whole computation into several "jobs" that can run in parallel.
  2. Create a separate thread for each job and run the processing in each thread.
  3. Wait until all of the threads have finished and destroy the threads.
So we might end up with code looking like this: 

//  structure describing a chunk of work running on a single thread
struct JobDesc
{
    int first_cell;
    int num_cells;
    const game_of_life_board* prev_board;
    game_of_life_board* new_board;
};

//  the thread (callback) function 
extern "C" {
    void* RunJob(void* p) {
        JobDesc* job = reinterpret_cast<JobDesc*>(p);
        //  for each cell in the job, compute its state for the next epoch, 
        //  based on the number of the alive neighbours
        int last_cell = job->first_cell + job->num_cells;
        for (int i = 0; i < last_cell; i++) {
            job->new_board.cell[i] = evolve_cell(job->prev_board.cell[i], 
                job->prev_board.num_neighbors(cur_cell));
        }
        return 0;
    }
}

//  evolves all cells one epoch, given the previous epoch state in prev_board
void game_of_life_board::evolve(const game_of_life_board& prev_board) {
    const int NUM_THREADS = 4;
    pthread_t threads[NUM_THREADS];
    JobDesc jobs[NUM_THREADS];

    //  create the threads and start processing
    for (int i = 0; i < NUM_THREADS; i++) {
        //  fill in the i-th job details
        JobDesc* job = jobs[i];
        int cells_per_job = (num_cells() + NUM_THREADS - 1)/NUM_THREADS;
        job.first_cell = cells_per_job*i;
        job.num_cells  = std::min(cells_per_job, num_cells() - cells_per_job*i);
        job.prev_board = &prev_board;
        job.new_board  = this;
        pthread_create(threads[i], NULL, RunJob, job);
    }

    //  wait for all the threads to finish processing
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
}

Which indeed seems to help:

 

We use all of the 4 cores to the brim and the program runs considerably faster. Mission accomplished? 

Let's just take a step back and see what we did here. There was a for-loop and we said: "Hey, this loop should be run in parallel, on different cores". That's all we wanted to happen but then we wrote half a hundred lines of (rather fragile, bug-prone and mostly boilerplate) threading code just to implement this simple notion. 

This reeks of accidental complexity. To make matters worse, there are other not-quite-so-clear details:
  • We create and destroy the threads every time: Depending on the operating system there might be a noticeable additional overhead doing so.
  • What if we have a different number of cores - will our program scale easily to use all of the available CPU power? How many threads do we need to create, in general?
  • The implementation uses POSIX threads which makes it platform-dependent. What if we want to run it on platforms that use a different threading API?
Even with such a basic example we have turned a simple piece of code into something that is vulnerable to many potential problems and that will be hard to maintain.

 

A Game Engine example

Let us consider a more elaborate example. Say, we are developing a game engine, which inside the game loop runs a "world update" function, which in turn does simulation on different things, such as updating skeletal animations on the visible models, particle systems, physics simulation etc. 

The code might look like this:
 
void GameWorld::Update() {
    StepAnimations();  
    StepParticles(); 
    StepPhysics(); 
}

void GameWorld::StepAnimations() {
    //  update skeletal animations for every currently visible model
    for (int i = 0; i < m_Animations.size(); i++) {
        //  ...
    }
}

void GameWorld::StepParticles() {
    //  update particle systems
    for (int i = 0; i < m_Particles.size(); i++) {
        //  ...
    }
}

void GameWorld::StepPhysics() {
    //  perform physics integration for every disjoint physics "island"
    for (int i = 0; i < m_PhysicsIslands.size(); i++) {
        //  ...
    }
}
 

This problem, at least in our simplified formulation, is also "embarrassingly parallel", except that now the parallelism manifests itself on multiple levels. The high-level intent might sound like: "Hey, there are these different tasks (StepAnimations/Particles/Physics/AI) that could be run in parallel... and each of them has a loop that can be run in parallel as well. Oh, and we might also want to run the whole thing in parallel with something else." 

If we approach the problem as for the Game of Life we instantly notice that most of the required boilerplate code is very similar to what we already did. 

The partitioning of the computations into different threads becomes more complicated: how do we split the work into N (the number of threads) equally-sized parallel chunks? Given that there may be several levels of these parallelizable tasks and that at each level the amount of work can be data-dependent, the task rapidly becomes intractable in most practical cases. 

A common design pattern used to tackle this problem is to use a global task queue and a thread pool. Every worker thread from the pool takes a task from the top of the task queue, completes it and then takes the next one available until the queue is empty. The tasks themselves are relatively fine-grained. Which means that they are small enough to distribute the work evenly, but not too small. 

As simple as this may sound conceptually, implementing it from scratch, correctly and efficiently, is even more complicated. Just a few things worth mentioning:
  • Operations on the task queue should be thread-safe (e.g. the classical producer/consumer problem might require two conditional variables).
  • From a performance point of view, thread contention should be avoided as much as possible: that is, threads should not spend too much time blocked on the thread queue access.
  • Cache coherence should be taken into account; subsequent tasks running on the same CPU should operate on proximate memory locations when possible.
In other words, the CPUs should spend most of their time doing useful computations and not waste cycles on scheduling, blocking or waiting for other threads to finish.

 

Enter Intel TBB

Let's slow down on piling up this article's accidental complexity and finally get to the point. 

The Intel Threading Building Blocks library tries to address the aforementioned concerns and help implement parallel algorithms on a higher level of abstraction, factoring away the boilerplate threading code, thread pools, task queuing/scheduling and other patterns that prevail in multithreaded code. Furthermore, the goal is to do so in a platform-independent and efficient way. 

It is a pure C++ library, meaning that it uses whatever capabilities the existing platforms and compilers already provide (in contrast to the OpenMP standard, amongst others). 

Getting back to our original Game of Life main loop, using the TBB library we can make it parallel like so: 

#include <tbb/tbb.h>
// ...
//  evolves all cells one epoch, given the previous epoch state in prev_board
void game_of_life_board::evolve(const game_of_life_board& prev_board) {
    //  for each cell, compute in parallel its state for the next epoch, 
    //  based on the number of the alive neighbours
    tbb::parallel_for(0, num_cells(), [&](int i) {
        cell[i] = evolve_cell(prev_board.cell[i], 
            prev_board.num_neighbors(i));
    });
}
 
As you can see, this time it looks very similar to the serial (non-parallel) version, which is, of course, a good thing. 

It uses the parallel for primitive; in practice it's a workhorse of parallel computations, especially the "embarrassingly parallel" ones. This code example also takes advantage of C++11 lambda expressions syntax, which allows to represent the callback function for the loop body in a concise (albeit somewhat unusual) way. 

A side note regarding the lambda expressions: if your C++ compiler does not support them (even though nowadays most of them do), then it's possible to use C++ functors. However, the syntax gets a bit more verbose: 

struct EvolveFn {
    //  captured data
    const game_of_life_board&       prev_board;
    game_of_life_board&             _this;

    //  constructor
    EvolveFn(const game_of_life_board& prev_board_, game_of_life_board& _this_) 
        : prev_board(prev_board_), 
        _this(_this_) {}

    //  the loop body
    void operator ()(int i) const {
        _this.cell[i] = _this.evolve(prev_board.cell[i], 
            prev_board.num_neighbors(i));
    }
};

//  evolves all cells one epoch, given the previous epoch state in prev_board
void step_epoch(const game_of_life_board& prev_board) {
    //  for each cell, compute in parallel its state for the next epoch, 
    //  based on the number of the alive neighbours
    tbb::parallel_for(0, num_cells(), EvolveFn(prev_board, *this));
}
 

If that looks too bad, it's possible to come up with some macro hacks and instead of directly using a C++ functor have something like this: 

//  evolves all cells one epoch, given the previous epoch state in prev_board
void step_epoch(const game_of_life_board& prev_board)
{
    //  for each cell, compute in parallel its state for the next epoch, 
    //  based on the number of the alive neighbours
    game_of_life_board* _this = this;
    LAMBDA_OPEN1(lambda, int,i)
    {
        _this->cell[i] = _this->evolve(prev_board.cell[i], 
            prev_board.num_neighbors(i));
    }
    LAMBDA_CLOSE2(lambda, 
        const game_of_life_board,prev_board, game_of_life_board*, _this);

    tbb::parallel_for(0, num_cells(), lambda);
}
 

Moving on, there is another basic high-level pattern that is useful: "run these different computations in parallel and wait for all of them to finish before continuing". This is what task groups in TBB are used for. The earlier Game Engine example code now becomes: 

#include <tbb/tbb.h>

void GameWorld::Update() {
    tbb::task_group tg;    

    tg.run([&]() { StepAnimations(); } 
    tg.run([&]() { StepParticles();  }
    tg.run([&]() { StepPhysics();    }

    tg.wait(); 
}

void GameWorld::StepAnimations() {
    //  update skeletal animations for every currently visible model
    tbb::parallel_for(0, m_Animations.size(), [&](int i) {
        //  ...
    });
}

void GameWorld::StepParticles() {
    //  update particle systems
    tbb::parallel_for(0, m_Particles.size(), [&](int i) {
        //  ...
    });
}

void GameWorld::StepPhysics() {
    //  perform physics integration for every disjoint physics "island"
    tbb::parallel_for(0, m_PhysicsIslands.size(), [&](int i) {
        //  ...
    });
}

Once again we have code that is similar to the original serial version yet this time it is parallel. TBB is taking care of the gritty scheduling details behind the scenes and trying to do so in an efficient way.

 

The magic that happens behind the scenes

We tell TBB about the boundaries of the "tasks" and the dependencies between them. In the previous example, we did it explicitly with the tasks group and implicitly with the parallel_for loops (leaving TBB to decide how to partition the iteration space into different chunks of work). 

These tasks are then arranged in correct order between different threads from a thread pool. The "task queue" concept is also implemented, with one notable difference: each thread has its own task queue instead of using a single global one. 

This solves the problem with access contention to a global task queue but adds an additional danger that some threads might empty their queues early and become idle. To mitigate this, TBB implements so-called task stealing so when a thread runs out of tasks it tries to "steal" them from the back of some other thread's queue. There is a bunch of heuristics and fine-tuning involved in this process to minimize contention and improve memory cache coherence. 

For the Game of Life example, with the help of TBB we obtain equally good utilization of all available CPUs (and good computation speed) as the original manual multithreading attempt. Except we use considerably less code and the result is more scalable. 

For the Game Engine example the average performance of the TBB solution in most cases will be actually better than the manually crafted one because of TBB's adaptive load balancing capabilities.

 

A bird's-eye view of the library

Speaking generally about the whole library, there are several components at different levels of abstraction with higher levels being based on the lower levels' functionality:

As may already be clear, the task scheduler is at the heart of TBB, taking care of scheduling the chunks of work ("tasks"). These tasks are implicitly organized into a DAG (directed acyclic graph). If there is an edge from task A to task B in this graph, it means that "task A depends on finishing task B before it can proceed".

Another group of low-level facilities is synchronization primitives, which have mutexes and atomics. There are several flavors of mutexes, starting from lightweight spinlocks, that can be extremely efficient in low contention scenarios, and ending with the platform-independent wrappers of OS-level mutexes (the "heaviest" ones). Atomics is an abstraction of lightweight atomic operations, usually supported by hardware. 

Containers is a set of classes that implement thread-safe counterparts of the containers in C++ standard library (with somewhat different API, since the original one was not designed with multithreading in mind and is not particularly suited for that).  

Algorithms include the patterns that we've already seen ("parallel for", "task groups"), and a bunch of other ones in addition, such as "parallel reduce", "pipeline", "parallel sort".

Flow graph is the highest level concept (and a relatively new addition), which allows to explicitly build data processing graphs from different building blocks on the fly.

 Additionally, there are various low-level utilities, such as:
  • Efficient memory allocator, fine-tuned for heavily multithreaded scenarios.
  • System threads and thread local storage wrappers.
  • Robust and thread-safe timing facilities.
  • Cross-thread exception handling (the C++ standard, at least before C++11 does not provide any means for handling exceptions that are thrown in different threads).
Of course, much more information can be found in the user documentation.

 

Licensing and portability

The library is dual-licensed: there is a commercial license (which provides support), and there is an open-source one (GPLv2). There is no difference in features between two; the source code of the whole library is available to the curious reader. 

 It's worth mentioning that the open source version has so-called runtime exception, which means that linking dynamically with the library does not automatically force the resulting executable into GPLv2 licensing, essentially meaning that open source version of TBB can be used in closed-software applications. 

The officially supported platforms are Linux (and most of the distributions have the library included), Windows, MacOS. The supported compilers are the Intel compiler (naturally), GCC and Microsoft Visual Studio. Additionally, since the library is open source, there are several community patches to bring support to other platforms, such as FreeBSD, QNX, Xbox360 etc. The Solaris OS and Solaris Studio compilers are also (unofficially) supported but with some limitations.

Final words

TBB is not the only library out there that provides useful abstractions for multithreaded programming in C++, and it's not necessarily the one you need. There are quite a few others on the landscape and it helps to be aware of them.


As often happens with silver bullets, they don't exist. TBB is designed to be used in processing-heavy applications; it's not the library to use when, for example, most of your tasks are spending time blocked on IO (for example, waiting on a network socket). It does not help in any way with distributed processing, when computations are supposed to be spread between different nodes. Since threads still work with shared memory, it does not guarantee that your program will be race condition and deadlock free (however, it can reduce the possibility of these common errors, since the amount of shared state can be minimized thanks to the task-level parallelization). 

So, as always, it's a matter of common sense and having at least a general understanding of the library's internals. Armed with that, Intel Threading Building Blocks can be a very powerful tool for implementing computation-heavy parallel algorithms.  

11 comments:

Brian Herman said...

Awesome!

Brian Herman said...
This comment has been removed by the author.
slimshader said...

Nice post but why are you using POSIX/C threads instead of std::async and std::futures? They are portable and make the code much cleaner and expressive.

Mathias Gaunard said...

You can have much better macros to create lambdas. See Boost.Local

Ilya Loshchinin said...

Great tutorial, Ruslan! I especially liked the middleware clouds at the end.

ruslans said...

slimshader,

Good point, and I'd do that if I did not want to demonstrate how verbose a low-level implementation can be :)
(well, it was based on real-world examples I worked with, anyway)

Another thing is, std::async et al. are a higher-level abstraction on their own. Also they are a part of the C++11 standard, which is still relatively new in regards to compiler implementations. Even though of course it did not stop me from using lambdas in the examples :)

ruslans said...

Mathias,

I heard that boost has a lambdas library, but confess to never have tried using it. I only took a quick look some time ago, and it seemed to be not exactly what is needed here.

I may be wrong, of course; could not find much about Boost.Local in particular, however - could you please give some more details where do I look?

ruslans said...

Ilya,

Thanks, it's a pleasant surprise to see you here! :)

Chris P. said...

A very cool post! Reading this inspired me to check out TBB, and it's exciting stuff.

I implemented a full (textual) Game Of Life in C++ to help cement the ideas: http://github.com/cpowell/game_of_life

ruslans said...

Chris, that's very neat!

Thanks for getting inspired! :)

Ryan Zhou said...

Very good tutorial.I'm learning TBB by myself, and I feel much more clear. But How would you do with TBB if the state of cell is not only determined by the number of surrounding cells, but also depend on the state of them?