## Aug 26, 2013

### Growing prime factorization plants

Inspired by a wonderful  Factorization Diagrams and Dancing Factors, I've come up with a little experiment in visualization of prime factorization for integers.

A la "math is beautiful" and all that.

The idea is that prime factors of a number are drawn as "plants". Every prime factor contributes to a branching point in the "plant", the branches have "fruits" at the tips, and the total amount of fruits is thus equal to the number itself.

For example, here's how factorization of 20 (2x2x5) and 54 (2x3x3x3) would look like:

I've hacked together a little program in C++,  it's available on github. Tested on Ubuntu Linux and Windows 7 (there is a project file for Visual Studio 2010, but you'll need to download and compile freeglut yourself in order for it to build).

The code there is as inefficient as it can get (so please don't take it as an example of how to write proper OpenGL programs), but for the sake of experiment it worked just fine.

Going a bit more into depth, at the heart of the whole thing is, of course, the integer factorization itself.

To keep things simple, it was done here with a very basic Trial Division method:

```template <typename T>
void prime_factors(T val, std::vector& factors, std::vector* primes_cache = NULL)
{
std::vector _primes;
std::vector& primes = primes_cache ? *primes_cache : _primes;

factors.clear();

T max_to_check = (T)sqrt((double)val) + 1;
find_primes(max_to_check, primes);
const int np = primes.size();

//  do trial division on the primes one by one
T cval = val;
int pidx = 0;
while (pidx < np)
{
const T cp = primes[pidx];
if (cp > max_to_check) break;
while (cval % cp == 0)
{
factors.push_back(cp);
cval /= cp;
}
pidx++;
}

if (cval > 1) factors.push_back(cval);
}
```

The list of primes (find_primes) is obtained with an Eratosthenes Sieve. With the latter I went a little bit overboard (alright... it just was a kind of fun), and added a few optimizations, even though it was not needed for the particular problem at hand:
• the primes less than 1000 are precomputed
• only odd numbers get sieved out
• a bit array is used for the "sieved out" status
• sieving is done in "segments", in order to use less memory and improve cache coherence
• for the next prime cp all the numbers less than square of cp will not get checked, since they have been already sieved out by smaller primes before
So here's the function itself:

```template <typename T>
void find_primes(T up_bound, std::vector& out_primes)
{
if (out_primes.size() > 0 && out_primes.back() >= up_bound)
{
//  already have the needed prime range
return;
}

const T BAKED_PRIMES[] = {
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97,
101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199,
211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293,
307, 311, 313, 317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397,
401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499,
503, 509, 521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599,
601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701,
709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787, 797,
809, 811, 821, 823, 827, 829, 839, 853, 857, 859, 863, 877, 881, 883, 887,
907, 911, 919, 929, 937, 941, 947, 953, 967, 971, 977, 983, 991, 997 };
const int NUM_BAKED_PRIMES = sizeof(BAKED_PRIMES)/sizeof(BAKED_PRIMES[0]);

//  search for the upper bound inside the baked range
const T* ptop = std::lower_bound(BAKED_PRIMES, BAKED_PRIMES + NUM_BAKED_PRIMES, up_bound + 1);

if (ptop < BAKED_PRIMES + NUM_BAKED_PRIMES)
{
//  got all the needed primes from the baked range
out_primes = std::vector(BAKED_PRIMES, ptop);
return;
}

if (out_primes.size() < NUM_BAKED_PRIMES)
{
//  take all the backed primes range before started sieving
out_primes = std::vector(BAKED_PRIMES, BAKED_PRIMES + NUM_BAKED_PRIMES);
}

//  do the Sieve of Eratosthenes on the rest of the range, segment by segment
const T SEGMENT_SIZE = 512;
const T SIEVED_ARR_SIZE = (SEGMENT_SIZE + 7)/8;
char sieved[SIEVED_ARR_SIZE]; // bit array of the "sieved" flags

T last_sieved = out_primes.back();
while (last_sieved < up_bound - 1)
{
const T num_sieved   = std::min(SEGMENT_SIZE, (up_bound - last_sieved)/2);
const T first_sieved = last_sieved + 2;
last_sieved = (num_sieved - 1)*2 + first_sieved;

//  reset the "sieved" flags
const T sieved_bytes = (num_sieved + 7)/8;
std::fill(sieved, sieved + sieved_bytes, 0);

//  sieve out the composites from the given range using the previously found primes
//  (only the odd numbers are considered)
const T max_to_check = (T)sqrt((double)last_sieved) + 1;
for (int i = 1, e = out_primes.size(); i < e; i++)
{
const T cp = out_primes[i];
if (cp > max_to_check) break;
const T fs   = std::max(first_sieved, cp*cp); // number to start sieving from
const T fd   = ((fs + cp - 1)/cp)*cp;         // first dividend in the range
const T fdo  = fd + cp*(1 - (fd&1));          // first odd dividend in the range
const T di   = (fdo - first_sieved)/2;        // index of the first odd dividend
for (T j = di; j < num_sieved; j += cp)
{
//  set the corresponding "sieved" bit
sieved[j/8] |= (1 << (j%8));
}
}

//  collect all the primes (the numbers with the "sieved" bit not set)
for (T i = 0; i < sieved_bytes; i++)
{
if (sieved[i] == 0xFF) continue; // no primes in the whole byte
for (int j = 0; j < 8; j++)
{
const T n = (i*8 + j);
if (!(sieved[i]&(1 << j)) && n < num_sieved)
{
out_primes.push_back(n*2 + first_sieved);
}
}
}
}
}```

It's a good example of how a simple piece of logic can eventually turn into an incomprehensible abomination. With all the evolution, history, bizarre optimizations and personality factors.

Anyway, getting to the visualization part, here's how the numbers from 1 to 16 look like:

From the prime_factors function we get the list of numbers, which is already sorted from the smallest primes to the biggest ones (which naturally helps in our visualization, by avoiding too much branching early on).

Then, we are going recursively from "leaves" (that is, from the last number in the primes list towards the first one), starting with a simple branch and a "fruit" at its tip:

With every prime number P from the list we repeat the same operation:
• all the branches we've got so far are cloned P times
• a new common root branch is added
• the parameters for the P cloned branches' are computed, such that their bases are neatly arranged along the semi-circle at the top of the newly added root branch
• each one from the P cloned branches is transformed by a composition of:
• a non-uniform scale: the scale about X axis is such that the base can fit together with other bases on the root's top semi-circle; scale about Y axis is fixed
• a shift upwards, so that the base lies on the root's top semi-circle
• rotation with the corresponding angle, so that all the P cloned branches are laid out evenly
• another shift upwards, to give space for the new root branch

It's worth noticing, that going leaves-to-root gives the tree building algorithm complexity of O(N*logN), while going in opposite direction could probably made with O(N), since we'd only need to transform every branch once. But that would only matter if we needed to optimize the algorithm, which at the moment we don't.

Because the scale that we apply on every step is non-uniform, the branches get "skewed", and this effect gets more profound towards the leaves. This may give the whole thing a rather bizarre look (which, let's say, is fine from the point of view of "aesthetics"):

In case when the prime number is too big, the "branches" are represented just with "fruits", and they are arranged into a "bud" (or "inflorescence"), which is using Fermat's spiral layout. Here's how 347 (a prime number) looks like:

The buds will be also drawn at the higher levels, albeit the non-uniform scale makes them look somewhat weird (and it's not obvious how exactly to scale the fruit radii, so I am just doing some kind of an average scale). That's what 1908 (2x2x3x3x53) turns into:

The fruits are colored randomly, using a "bipolar" green-purple color map... just because it looked funky this way (and also because I think that Kenneth Moreland's work is cool).

In the test program, it can be toggled on/off via 'c' button:

So that's about it, we've got our "factorization plants".

Turns out, prime factorization is a fascinating problem, which is important in the computing world, being, for example, at the core of widely used cryptography algorithms. It's a complex problem as well.

Getting a little bit of a visual insight into it feels like a little tiny step into understanding the problem. Even though in reality it's just some amusing pictures.

But hey, it looks nice, so who cares.

## Aug 15, 2013

### Introduction to high-level multithreading in C++ via Intel TBB library

"The free lunch is over!"
"Amdahl's law will find and punish you!.."

You've probably already heard these maxims. Moms use them to scare kids who are throwing spaghetti all over the table and using global variables all day long.

And you know it's fair since by now you've learned (the hard way, presumably) that writing parallel software is indeed a complex endeavor. Or is it?

There is a practical observation that a software problem's complexity is at least twofold:
• "Inherent" or unavoidable complexity that is at the essence of the problem.
• "Accidental" complexity which covers the work that does not really have much to do with the problem at hand but needs doing anyway.
People have been developing new programming techniques, higher level languages, design patterns, frameworks and methodologies; all in order to mitigate the accidental complexity and to be able to focus only on the essential parts, thus improving productivity and reducing development risks.

Would it be possible to sweep under the rug some of the work that makes multithreaded programming so complex?

It turns out that this is exactly what the Intel Threading Building Blocks (Intel® TBB) library can help with and so, to illustrate where this is leading, let us first consider a simple problem.

### The Game of Life

Say we are implementing a classic Artificial Intelligence simulation, Conway's Game of Life, which is defined by simple rules.

The "world" is a big rectangular array of cells, each cell being in one of two possible states: alive or dead. The initial state of the board is random and at each discrete "epoch" we recompute the state of every cell such that:
• A live cell will die if it has less than two (under-population) or more than three (overcrowding) neighbors.
• A dead cell with exactly three neighbors will come alive (reproduction).
• Otherwise the state will remain unchanged.
The evolution rules for a single cell may be expressed by a C++ function:

```static const int DEAD = 0, ALIVE = 1;
//  returns cell's new state base on the evolution rules
int evolve_cell(int cur_state, int num_neighbors) {
if (cur_state == ALIVE) {
if (num_neighbors < 2 ||  // under-population
num_neighbors > 3)    // overcrowding
} else {
if (num_neighbors == 3) return ALIVE; //  reproduction
}
return cur_state;
}```

The rules are applied to each cell on the board, i.e. we have a big for-loop where, for each cell, we compute its state for the next epoch based on the number of live neighbors:

```//  evolves all cells one epoch, given the previous epoch state in prev_board
void game_of_life_board::evolve(const game_of_life_board& prev_board) {
for (int i = 0; i < num_cells(); i++) {
cell[i] = evolve_cell(prev_board.cell[i], prev_board.num_neighbors(i));
}
}```
` `
Here is how a typical board might look after some iterations:

Although the algorithm is trivial, if there are many cells to simulate the processing time will add up and we might want to start optimizing.

### Trying to optimize the code

The right thing to do here would be to try to improve the algorithm first (after having made sure that we are solving the right problem). If that does not help, then start to pay attention to things like memory access patterns, branch prediction, loop unrolling etc. Lets say we also hit a wall at this point. What do we do next?

When the program runs on an Intel i7 laptop, the CPU usage looks like this:

There are 4 CPU cores but our computation only runs on a single core at time. In practice the operating system makes it "jump" between different cores due to the presence of other processes on the system and the specifics of the OS process scheduler. If we ask the system not to do that (by setting the process CPU affinity), the picture changes to:

As you can see, the 4 CPU cores are largely under-utilized: we are effectively using only one core. This probably means that we could potentially enlist the three other cores and improve the speed.

So, how do we approach this?

One thing to notice about this for-loop is that the computation on a cell does not modify any state that the other cells' computations might be affected by. Such algorithms are sometimes called "embarrassingly parallel"; that is, splitting the problem into parallel tasks is relatively effortless.

Let's try to leverage this fact.

The traditional and the most basic approach would be to:
1. Somehow partition the whole computation into several "jobs" that can run in parallel.
2. Create a separate thread for each job and run the processing in each thread.
3. Wait until all of the threads have finished and destroy the threads.
So we might end up with code looking like this:

```//  structure describing a chunk of work running on a single thread
struct JobDesc
{
int first_cell;
int num_cells;
const game_of_life_board* prev_board;
game_of_life_board* new_board;
};

extern "C" {
void* RunJob(void* p) {
JobDesc* job = reinterpret_cast<JobDesc*>(p);
//  for each cell in the job, compute its state for the next epoch,
//  based on the number of the alive neighbours
int last_cell = job->first_cell + job->num_cells;
for (int i = 0; i < last_cell; i++) {
job->new_board.cell[i] = evolve_cell(job->prev_board.cell[i],
job->prev_board.num_neighbors(cur_cell));
}
return 0;
}
}

//  evolves all cells one epoch, given the previous epoch state in prev_board
void game_of_life_board::evolve(const game_of_life_board& prev_board) {

//  create the threads and start processing
for (int i = 0; i < NUM_THREADS; i++) {
//  fill in the i-th job details
JobDesc* job = jobs[i];
job.first_cell = cells_per_job*i;
job.num_cells  = std::min(cells_per_job, num_cells() - cells_per_job*i);
job.prev_board = &prev_board;
job.new_board  = this;
}

//  wait for all the threads to finish processing
for (int i = 0; i < NUM_THREADS; i++) {
}
}```

Which indeed seems to help:

We use all of the 4 cores to the brim and the program runs considerably faster. Mission accomplished?

Let's just take a step back and see what we did here. There was a for-loop and we said: "Hey, this loop should be run in parallel, on different cores". That's all we wanted to happen but then we wrote half a hundred lines of (rather fragile, bug-prone and mostly boilerplate) threading code just to implement this simple notion.

This reeks of accidental complexity. To make matters worse, there are other not-quite-so-clear details:
• We create and destroy the threads every time: Depending on the operating system there might be a noticeable additional overhead doing so.
• What if we have a different number of cores - will our program scale easily to use all of the available CPU power? How many threads do we need to create, in general?
• The implementation uses POSIX threads which makes it platform-dependent. What if we want to run it on platforms that use a different threading API?
Even with such a basic example we have turned a simple piece of code into something that is vulnerable to many potential problems and that will be hard to maintain.

### A Game Engine example

Let us consider a more elaborate example. Say, we are developing a game engine, which inside the game loop runs a "world update" function, which in turn does simulation on different things, such as updating skeletal animations on the visible models, particle systems, physics simulation etc.

The code might look like this:
` `
```void GameWorld::Update() {
StepAnimations();
StepParticles();
StepPhysics();
}

void GameWorld::StepAnimations() {
//  update skeletal animations for every currently visible model
for (int i = 0; i < m_Animations.size(); i++) {
//  ...
}
}

void GameWorld::StepParticles() {
//  update particle systems
for (int i = 0; i < m_Particles.size(); i++) {
//  ...
}
}

void GameWorld::StepPhysics() {
//  perform physics integration for every disjoint physics "island"
for (int i = 0; i < m_PhysicsIslands.size(); i++) {
//  ...
}
}```
` `

This problem, at least in our simplified formulation, is also "embarrassingly parallel", except that now the parallelism manifests itself on multiple levels. The high-level intent might sound like: "Hey, there are these different tasks (StepAnimations/Particles/Physics/AI) that could be run in parallel... and each of them has a loop that can be run in parallel as well. Oh, and we might also want to run the whole thing in parallel with something else."

If we approach the problem as for the Game of Life we instantly notice that most of the required boilerplate code is very similar to what we already did.

The partitioning of the computations into different threads becomes more complicated: how do we split the work into N (the number of threads) equally-sized parallel chunks? Given that there may be several levels of these parallelizable tasks and that at each level the amount of work can be data-dependent, the task rapidly becomes intractable in most practical cases.

A common design pattern used to tackle this problem is to use a global task queue and a thread pool. Every worker thread from the pool takes a task from the top of the task queue, completes it and then takes the next one available until the queue is empty. The tasks themselves are relatively fine-grained. Which means that they are small enough to distribute the work evenly, but not too small.

As simple as this may sound conceptually, implementing it from scratch, correctly and efficiently, is even more complicated. Just a few things worth mentioning:
• Operations on the task queue should be thread-safe (e.g. the classical producer/consumer problem might require two conditional variables).
• From a performance point of view, thread contention should be avoided as much as possible: that is, threads should not spend too much time blocked on the thread queue access.
• Cache coherence should be taken into account; subsequent tasks running on the same CPU should operate on proximate memory locations when possible.
In other words, the CPUs should spend most of their time doing useful computations and not waste cycles on scheduling, blocking or waiting for other threads to finish.

### Enter Intel TBB

Let's slow down on piling up this article's accidental complexity and finally get to the point.

The Intel Threading Building Blocks library tries to address the aforementioned concerns and help implement parallel algorithms on a higher level of abstraction, factoring away the boilerplate threading code, thread pools, task queuing/scheduling and other patterns that prevail in multithreaded code. Furthermore, the goal is to do so in a platform-independent and efficient way.

It is a pure C++ library, meaning that it uses whatever capabilities the existing platforms and compilers already provide (in contrast to the OpenMP standard, amongst others).

Getting back to our original Game of Life main loop, using the TBB library we can make it parallel like so:

```#include <tbb/tbb.h>
// ...
//  evolves all cells one epoch, given the previous epoch state in prev_board
void game_of_life_board::evolve(const game_of_life_board& prev_board) {
//  for each cell, compute in parallel its state for the next epoch,
//  based on the number of the alive neighbours
tbb::parallel_for(0, num_cells(), [&](int i) {
cell[i] = evolve_cell(prev_board.cell[i],
prev_board.num_neighbors(i));
});
}```
` `
As you can see, this time it looks very similar to the serial (non-parallel) version, which is, of course, a good thing.

It uses the parallel for primitive; in practice it's a workhorse of parallel computations, especially the "embarrassingly parallel" ones. This code example also takes advantage of C++11 lambda expressions syntax, which allows to represent the callback function for the loop body in a concise (albeit somewhat unusual) way.

A side note regarding the lambda expressions: if your C++ compiler does not support them (even though nowadays most of them do), then it's possible to use C++ functors. However, the syntax gets a bit more verbose:

```struct EvolveFn {
//  captured data
const game_of_life_board&       prev_board;
game_of_life_board&             _this;

//  constructor
EvolveFn(const game_of_life_board& prev_board_, game_of_life_board& _this_)
: prev_board(prev_board_),
_this(_this_) {}

//  the loop body
void operator ()(int i) const {
_this.cell[i] = _this.evolve(prev_board.cell[i],
prev_board.num_neighbors(i));
}
};

//  evolves all cells one epoch, given the previous epoch state in prev_board
void step_epoch(const game_of_life_board& prev_board) {
//  for each cell, compute in parallel its state for the next epoch,
//  based on the number of the alive neighbours
tbb::parallel_for(0, num_cells(), EvolveFn(prev_board, *this));
}```
` `

If that looks too bad, it's possible to come up with some macro hacks and instead of directly using a C++ functor have something like this:

```//  evolves all cells one epoch, given the previous epoch state in prev_board
void step_epoch(const game_of_life_board& prev_board)
{
//  for each cell, compute in parallel its state for the next epoch,
//  based on the number of the alive neighbours
game_of_life_board* _this = this;
LAMBDA_OPEN1(lambda, int,i)
{
_this->cell[i] = _this->evolve(prev_board.cell[i],
prev_board.num_neighbors(i));
}
LAMBDA_CLOSE2(lambda,
const game_of_life_board,prev_board, game_of_life_board*, _this);

tbb::parallel_for(0, num_cells(), lambda);
}```
` `

Moving on, there is another basic high-level pattern that is useful: "run these different computations in parallel and wait for all of them to finish before continuing". This is what task groups in TBB are used for. The earlier Game Engine example code now becomes:

```#include <tbb/tbb.h>

void GameWorld::Update() {

tg.run([&]() { StepAnimations(); }
tg.run([&]() { StepParticles();  }
tg.run([&]() { StepPhysics();    }

tg.wait();
}

void GameWorld::StepAnimations() {
//  update skeletal animations for every currently visible model
tbb::parallel_for(0, m_Animations.size(), [&](int i) {
//  ...
});
}

void GameWorld::StepParticles() {
//  update particle systems
tbb::parallel_for(0, m_Particles.size(), [&](int i) {
//  ...
});
}

void GameWorld::StepPhysics() {
//  perform physics integration for every disjoint physics "island"
tbb::parallel_for(0, m_PhysicsIslands.size(), [&](int i) {
//  ...
});
}```

Once again we have code that is similar to the original serial version yet this time it is parallel. TBB is taking care of the gritty scheduling details behind the scenes and trying to do so in an efficient way.

### The magic that happens behind the scenes

We tell TBB about the boundaries of the "tasks" and the dependencies between them. In the previous example, we did it explicitly with the tasks group and implicitly with the parallel_for loops (leaving TBB to decide how to partition the iteration space into different chunks of work).

These tasks are then arranged in correct order between different threads from a thread pool. The "task queue" concept is also implemented, with one notable difference: each thread has its own task queue instead of using a single global one.

This solves the problem with access contention to a global task queue but adds an additional danger that some threads might empty their queues early and become idle. To mitigate this, TBB implements so-called task stealing so when a thread runs out of tasks it tries to "steal" them from the back of some other thread's queue. There is a bunch of heuristics and fine-tuning involved in this process to minimize contention and improve memory cache coherence.

For the Game of Life example, with the help of TBB we obtain equally good utilization of all available CPUs (and good computation speed) as the original manual multithreading attempt. Except we use considerably less code and the result is more scalable.

For the Game Engine example the average performance of the TBB solution in most cases will be actually better than the manually crafted one because of TBB's adaptive load balancing capabilities.

### A bird's-eye view of the library

Speaking generally about the whole library, there are several components at different levels of abstraction with higher levels being based on the lower levels' functionality:

As may already be clear, the task scheduler is at the heart of TBB, taking care of scheduling the chunks of work ("tasks"). These tasks are implicitly organized into a DAG (directed acyclic graph). If there is an edge from task A to task B in this graph, it means that "task A depends on finishing task B before it can proceed".

Another group of low-level facilities is synchronization primitives, which have mutexes and atomics. There are several flavors of mutexes, starting from lightweight spinlocks, that can be extremely efficient in low contention scenarios, and ending with the platform-independent wrappers of OS-level mutexes (the "heaviest" ones). Atomics is an abstraction of lightweight atomic operations, usually supported by hardware.

Containers is a set of classes that implement thread-safe counterparts of the containers in C++ standard library (with somewhat different API, since the original one was not designed with multithreading in mind and is not particularly suited for that).

Algorithms include the patterns that we've already seen ("parallel for", "task groups"), and a bunch of other ones in addition, such as "parallel reduce", "pipeline", "parallel sort".

Flow graph is the highest level concept (and a relatively new addition), which allows to explicitly build data processing graphs from different building blocks on the fly.

Additionally, there are various low-level utilities, such as:
• Efficient memory allocator, fine-tuned for heavily multithreaded scenarios.
• Robust and thread-safe timing facilities.
• Cross-thread exception handling (the C++ standard, at least before C++11 does not provide any means for handling exceptions that are thrown in different threads).
Of course, much more information can be found in the user documentation.

### Licensing and portability

The library is dual-licensed: there is a commercial license (which provides support), and there is an open-source one (GPLv2). There is no difference in features between two; the source code of the whole library is available to the curious reader.

It's worth mentioning that the open source version has so-called runtime exception, which means that linking dynamically with the library does not automatically force the resulting executable into GPLv2 licensing, essentially meaning that open source version of TBB can be used in closed-software applications.

The officially supported platforms are Linux (and most of the distributions have the library included), Windows, MacOS. The supported compilers are the Intel compiler (naturally), GCC and Microsoft Visual Studio. Additionally, since the library is open source, there are several community patches to bring support to other platforms, such as FreeBSD, QNX, Xbox360 etc. The Solaris OS and Solaris Studio compilers are also (unofficially) supported but with some limitations.

### Final words

TBB is not the only library out there that provides useful abstractions for multithreaded programming in C++, and it's not necessarily the one you need. There are quite a few others on the landscape and it helps to be aware of them.

As often happens with silver bullets, they don't exist. TBB is designed to be used in processing-heavy applications; it's not the library to use when, for example, most of your tasks are spending time blocked on IO (for example, waiting on a network socket). It does not help in any way with distributed processing, when computations are supposed to be spread between different nodes. Since threads still work with shared memory, it does not guarantee that your program will be race condition and deadlock free (however, it can reduce the possibility of these common errors, since the amount of shared state can be minimized thanks to the task-level parallelization).

So, as always, it's a matter of common sense and having at least a general understanding of the library's internals. Armed with that, Intel Threading Building Blocks can be a very powerful tool for implementing computation-heavy parallel algorithms.