8/26/13

Growing prime factorization plants

Inspired by a wonderful  Factorization Diagrams and Dancing Factors, I've come up with a little experiment in visualization of prime factorization for integers. 

A la "math is beautiful" and all that.

The idea is that prime factors of a number are drawn as "plants". Every prime factor contributes to a branching point in the "plant", the branches have "fruits" at the tips, and the total amount of fruits is thus equal to the number itself.

For example, here's how factorization of 20 (2x2x5) and 54 (2x3x3x3) would look like:


I've hacked together a little program in C++,  it's available on github. Tested on Ubuntu Linux and Windows 7 (there is a project file for Visual Studio 2010, but you'll need to download and compile freeglut yourself in order for it to build).

The code there is as inefficient as it can get (so please don't take it as an example of how to write proper OpenGL programs), but for the sake of experiment it worked just fine.

Going a bit more into depth, at the heart of the whole thing is, of course, the integer factorization itself.

To keep things simple, it was done here with a very basic Trial Division method:

template <typename T>
void prime_factors(T val, std::vector& factors, std::vector* primes_cache = NULL)
{
    std::vector _primes;
    std::vector& primes = primes_cache ? *primes_cache : _primes;

    factors.clear();

    T max_to_check = (T)sqrt((double)val) + 1;
    find_primes(max_to_check, primes);
    const int np = primes.size();

    //  do trial division on the primes one by one
    T cval = val;
    int pidx = 0;
    while (pidx < np)
    {
        const T cp = primes[pidx];
        if (cp > max_to_check) break;
        while (cval % cp == 0) 
        {
            factors.push_back(cp);
            cval /= cp;
        }
        pidx++;
    }

    if (cval > 1) factors.push_back(cval);
}


The list of primes (find_primes) is obtained with an Eratosthenes Sieve. With the latter I went a little bit overboard (alright... it just was a kind of fun), and added a few optimizations, even though it was not needed for the particular problem at hand:
  • the primes less than 1000 are precomputed
  • only odd numbers get sieved out
  • a bit array is used for the "sieved out" status
  • sieving is done in "segments", in order to use less memory and improve cache coherence
  • for the next prime cp all the numbers less than square of cp will not get checked, since they have been already sieved out by smaller primes before
So here's the function itself:

template <typename T>
void find_primes(T up_bound, std::vector& out_primes)
{
    if (out_primes.size() > 0 && out_primes.back() >= up_bound) 
    {
        //  already have the needed prime range
        return;
    }

    const T BAKED_PRIMES[] = {
        2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 
        101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 
        211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 
        307, 311, 313, 317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397, 
        401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 
        503, 509, 521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599, 
        601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701, 
        709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787, 797, 
        809, 811, 821, 823, 827, 829, 839, 853, 857, 859, 863, 877, 881, 883, 887, 
        907, 911, 919, 929, 937, 941, 947, 953, 967, 971, 977, 983, 991, 997 };
    const int NUM_BAKED_PRIMES = sizeof(BAKED_PRIMES)/sizeof(BAKED_PRIMES[0]);

    //  search for the upper bound inside the baked range
    const T* ptop = std::lower_bound(BAKED_PRIMES, BAKED_PRIMES + NUM_BAKED_PRIMES, up_bound + 1);
        
    if (ptop < BAKED_PRIMES + NUM_BAKED_PRIMES)
    {
        //  got all the needed primes from the baked range
        out_primes = std::vector(BAKED_PRIMES, ptop);
        return;
    }

    if (out_primes.size() < NUM_BAKED_PRIMES)
    {
        //  take all the backed primes range before started sieving
        out_primes = std::vector(BAKED_PRIMES, BAKED_PRIMES + NUM_BAKED_PRIMES);
    }

    //  do the Sieve of Eratosthenes on the rest of the range, segment by segment
    const T SEGMENT_SIZE = 512;
    const T SIEVED_ARR_SIZE = (SEGMENT_SIZE + 7)/8;
    char sieved[SIEVED_ARR_SIZE]; // bit array of the "sieved" flags

    T last_sieved = out_primes.back();
    while (last_sieved < up_bound - 1)
    {
        const T num_sieved   = std::min(SEGMENT_SIZE, (up_bound - last_sieved)/2);
        const T first_sieved = last_sieved + 2;
        last_sieved = (num_sieved - 1)*2 + first_sieved;
    
        //  reset the "sieved" flags
        const T sieved_bytes = (num_sieved + 7)/8;
        std::fill(sieved, sieved + sieved_bytes, 0);

        //  sieve out the composites from the given range using the previously found primes 
        //  (only the odd numbers are considered)
        const T max_to_check = (T)sqrt((double)last_sieved) + 1;
        for (int i = 1, e = out_primes.size(); i < e; i++)
        {
            const T cp = out_primes[i];
            if (cp > max_to_check) break;
            const T fs   = std::max(first_sieved, cp*cp); // number to start sieving from
            const T fd   = ((fs + cp - 1)/cp)*cp;         // first dividend in the range
            const T fdo  = fd + cp*(1 - (fd&1));          // first odd dividend in the range
            const T di   = (fdo - first_sieved)/2;        // index of the first odd dividend
            for (T j = di; j < num_sieved; j += cp) 
            {
                //  set the corresponding "sieved" bit
                sieved[j/8] |= (1 << (j%8));
            }
        }

        //  collect all the primes (the numbers with the "sieved" bit not set)
        for (T i = 0; i < sieved_bytes; i++) 
        {
            if (sieved[i] == 0xFF) continue; // no primes in the whole byte
            for (int j = 0; j < 8; j++)
            {
                const T n = (i*8 + j);
                if (!(sieved[i]&(1 << j)) && n < num_sieved)
                {
                    out_primes.push_back(n*2 + first_sieved);
                }
            }
        }
    }
}
      
It's a good example of how a simple piece of logic can eventually turn into an incomprehensible abomination. With all the evolution, history, bizarre optimizations and personality factors.

Anyway, getting to the visualization part, here's how the numbers from 1 to 16 look like:

 

From the prime_factors function we get the list of numbers, which is already sorted from the smallest primes to the biggest ones (which naturally helps in our visualization, by avoiding too much branching early on).

Then, we are going recursively from "leaves" (that is, from the last number in the primes list towards the first one), starting with a simple branch and a "fruit" at its tip:


With every prime number P from the list we repeat the same operation: 
  • all the branches we've got so far are cloned P times
  • a new common root branch is added
  • the parameters for the P cloned branches' are computed, such that their bases are neatly arranged along the semi-circle at the top of the newly added root branch
  • each one from the P cloned branches is transformed by a composition of:
    • a non-uniform scale: the scale about X axis is such that the base can fit together with other bases on the root's top semi-circle; scale about Y axis is fixed
    • a shift upwards, so that the base lies on the root's top semi-circle
    • rotation with the corresponding angle, so that all the P cloned branches are laid out evenly
    • another shift upwards, to give space for the new root branch

It's worth noticing, that going leaves-to-root gives the tree building algorithm complexity of O(N*logN), while going in opposite direction could probably made with O(N), since we'd only need to transform every branch once. But that would only matter if we needed to optimize the algorithm, which at the moment we don't.

Because the scale that we apply on every step is non-uniform, the branches get "skewed", and this effect gets more profound towards the leaves. This may give the whole thing a rather bizarre look (which, let's say, is fine from the point of view of "aesthetics"):


In case when the prime number is too big, the "branches" are represented just with "fruits", and they are arranged into a "bud" (or "inflorescence"), which is using Fermat's spiral layout. Here's how 347 (a prime number) looks like:


The buds will be also drawn at the higher levels, albeit the non-uniform scale makes them look somewhat weird (and it's not obvious how exactly to scale the fruit radii, so I am just doing some kind of an average scale). That's what 1908 (2x2x3x3x53) turns into:


The fruits are colored randomly, using a "bipolar" green-purple color map... just because it looked funky this way (and also because I think that Kenneth Moreland's work is cool).

In the test program, it can be toggled on/off via 'c' button:


So that's about it, we've got our "factorization plants".

Turns out, prime factorization is a fascinating problem, which is important in the computing world, being, for example, at the core of widely used cryptography algorithms. It's a complex problem as well.

Getting a little bit of a visual insight into it feels like a little tiny step into understanding the problem. Even though in reality it's just some amusing pictures.

But hey, it looks nice, so who cares.

8/15/13

Introduction to high-level multithreading in C++ via Intel TBB library


"The free lunch is over!"
"Multithreading is hard!"
"Amdahl's law will find and punish you!.."

You've probably already heard these maxims. Moms use them to scare kids who are throwing spaghetti all over the table and using global variables all day long. 

And you know it's fair since by now you've learned (the hard way, presumably) that writing parallel software is indeed a complex endeavor. Or is it? 

There is a practical observation that a software problem's complexity is at least twofold:
  • "Inherent" or unavoidable complexity that is at the essence of the problem.
  • "Accidental" complexity which covers the work that does not really have much to do with the problem at hand but needs doing anyway.
People have been developing new programming techniques, higher level languages, design patterns, frameworks and methodologies; all in order to mitigate the accidental complexity and to be able to focus only on the essential parts, thus improving productivity and reducing development risks. 

Would it be possible to sweep under the rug some of the work that makes multithreaded programming so complex? 

It turns out that this is exactly what the Intel Threading Building Blocks (Intel® TBB) library can help with and so, to illustrate where this is leading, let us first consider a simple problem.

 

The Game of Life

Say we are implementing a classic Artificial Intelligence simulation, Conway's Game of Life, which is defined by simple rules. 

The "world" is a big rectangular array of cells, each cell being in one of two possible states: alive or dead. The initial state of the board is random and at each discrete "epoch" we recompute the state of every cell such that:
  • A live cell will die if it has less than two (under-population) or more than three (overcrowding) neighbors.
  • A dead cell with exactly three neighbors will come alive (reproduction).
  • Otherwise the state will remain unchanged.
The evolution rules for a single cell may be expressed by a C++ function:

static const int DEAD = 0, ALIVE = 1;
//  returns cell's new state base on the evolution rules
int evolve_cell(int cur_state, int num_neighbors) {
    if (cur_state == ALIVE) {
        if (num_neighbors < 2 ||  // under-population
            num_neighbors > 3)    // overcrowding
            return DEAD;
    } else {
        if (num_neighbors == 3) return ALIVE; //  reproduction
    }
    return cur_state;
}

The rules are applied to each cell on the board, i.e. we have a big for-loop where, for each cell, we compute its state for the next epoch based on the number of live neighbors:

//  evolves all cells one epoch, given the previous epoch state in prev_board
void game_of_life_board::evolve(const game_of_life_board& prev_board) {
    for (int i = 0; i < num_cells(); i++) {
        cell[i] = evolve_cell(prev_board.cell[i], prev_board.num_neighbors(i));
    }
}
 
Here is how a typical board might look after some iterations:



Although the algorithm is trivial, if there are many cells to simulate the processing time will add up and we might want to start optimizing.

 

Trying to optimize the code

The right thing to do here would be to try to improve the algorithm first (after having made sure that we are solving the right problem). If that does not help, then start to pay attention to things like memory access patterns, branch prediction, loop unrolling etc. Lets say we also hit a wall at this point. What do we do next?

When the program runs on an Intel i7 laptop, the CPU usage looks like this: 

There are 4 CPU cores but our computation only runs on a single core at time. In practice the operating system makes it "jump" between different cores due to the presence of other processes on the system and the specifics of the OS process scheduler. If we ask the system not to do that (by setting the process CPU affinity), the picture changes to: 

 
As you can see, the 4 CPU cores are largely under-utilized: we are effectively using only one core. This probably means that we could potentially enlist the three other cores and improve the speed. 

So, how do we approach this? 

One thing to notice about this for-loop is that the computation on a cell does not modify any state that the other cells' computations might be affected by. Such algorithms are sometimes called "embarrassingly parallel"; that is, splitting the problem into parallel tasks is relatively effortless. 

Let's try to leverage this fact.

 

An attempt at multithreading

The traditional and the most basic approach would be to:
  1. Somehow partition the whole computation into several "jobs" that can run in parallel.
  2. Create a separate thread for each job and run the processing in each thread.
  3. Wait until all of the threads have finished and destroy the threads.
So we might end up with code looking like this: 

//  structure describing a chunk of work running on a single thread
struct JobDesc
{
    int first_cell;
    int num_cells;
    const game_of_life_board* prev_board;
    game_of_life_board* new_board;
};

//  the thread (callback) function 
extern "C" {
    void* RunJob(void* p) {
        JobDesc* job = reinterpret_cast<JobDesc*>(p);
        //  for each cell in the job, compute its state for the next epoch, 
        //  based on the number of the alive neighbours
        int last_cell = job->first_cell + job->num_cells;
        for (int i = 0; i < last_cell; i++) {
            job->new_board.cell[i] = evolve_cell(job->prev_board.cell[i], 
                job->prev_board.num_neighbors(cur_cell));
        }
        return 0;
    }
}

//  evolves all cells one epoch, given the previous epoch state in prev_board
void game_of_life_board::evolve(const game_of_life_board& prev_board) {
    const int NUM_THREADS = 4;
    pthread_t threads[NUM_THREADS];
    JobDesc jobs[NUM_THREADS];

    //  create the threads and start processing
    for (int i = 0; i < NUM_THREADS; i++) {
        //  fill in the i-th job details
        JobDesc* job = jobs[i];
        int cells_per_job = (num_cells() + NUM_THREADS - 1)/NUM_THREADS;
        job.first_cell = cells_per_job*i;
        job.num_cells  = std::min(cells_per_job, num_cells() - cells_per_job*i);
        job.prev_board = &prev_board;
        job.new_board  = this;
        pthread_create(threads[i], NULL, RunJob, job);
    }

    //  wait for all the threads to finish processing
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
}

Which indeed seems to help:

 

We use all of the 4 cores to the brim and the program runs considerably faster. Mission accomplished? 

Let's just take a step back and see what we did here. There was a for-loop and we said: "Hey, this loop should be run in parallel, on different cores". That's all we wanted to happen but then we wrote half a hundred lines of (rather fragile, bug-prone and mostly boilerplate) threading code just to implement this simple notion. 

This reeks of accidental complexity. To make matters worse, there are other not-quite-so-clear details:
  • We create and destroy the threads every time: Depending on the operating system there might be a noticeable additional overhead doing so.
  • What if we have a different number of cores - will our program scale easily to use all of the available CPU power? How many threads do we need to create, in general?
  • The implementation uses POSIX threads which makes it platform-dependent. What if we want to run it on platforms that use a different threading API?
Even with such a basic example we have turned a simple piece of code into something that is vulnerable to many potential problems and that will be hard to maintain.

 

A Game Engine example

Let us consider a more elaborate example. Say, we are developing a game engine, which inside the game loop runs a "world update" function, which in turn does simulation on different things, such as updating skeletal animations on the visible models, particle systems, physics simulation etc. 

The code might look like this:
 
void GameWorld::Update() {
    StepAnimations();  
    StepParticles(); 
    StepPhysics(); 
}

void GameWorld::StepAnimations() {
    //  update skeletal animations for every currently visible model
    for (int i = 0; i < m_Animations.size(); i++) {
        //  ...
    }
}

void GameWorld::StepParticles() {
    //  update particle systems
    for (int i = 0; i < m_Particles.size(); i++) {
        //  ...
    }
}

void GameWorld::StepPhysics() {
    //  perform physics integration for every disjoint physics "island"
    for (int i = 0; i < m_PhysicsIslands.size(); i++) {
        //  ...
    }
}
 

This problem, at least in our simplified formulation, is also "embarrassingly parallel", except that now the parallelism manifests itself on multiple levels. The high-level intent might sound like: "Hey, there are these different tasks (StepAnimations/Particles/Physics/AI) that could be run in parallel... and each of them has a loop that can be run in parallel as well. Oh, and we might also want to run the whole thing in parallel with something else." 

If we approach the problem as for the Game of Life we instantly notice that most of the required boilerplate code is very similar to what we already did. 

The partitioning of the computations into different threads becomes more complicated: how do we split the work into N (the number of threads) equally-sized parallel chunks? Given that there may be several levels of these parallelizable tasks and that at each level the amount of work can be data-dependent, the task rapidly becomes intractable in most practical cases. 

A common design pattern used to tackle this problem is to use a global task queue and a thread pool. Every worker thread from the pool takes a task from the top of the task queue, completes it and then takes the next one available until the queue is empty. The tasks themselves are relatively fine-grained. Which means that they are small enough to distribute the work evenly, but not too small. 

As simple as this may sound conceptually, implementing it from scratch, correctly and efficiently, is even more complicated. Just a few things worth mentioning:
  • Operations on the task queue should be thread-safe (e.g. the classical producer/consumer problem might require two conditional variables).
  • From a performance point of view, thread contention should be avoided as much as possible: that is, threads should not spend too much time blocked on the thread queue access.
  • Cache coherence should be taken into account; subsequent tasks running on the same CPU should operate on proximate memory locations when possible.
In other words, the CPUs should spend most of their time doing useful computations and not waste cycles on scheduling, blocking or waiting for other threads to finish.

 

Enter Intel TBB

Let's slow down on piling up this article's accidental complexity and finally get to the point. 

The Intel Threading Building Blocks library tries to address the aforementioned concerns and help implement parallel algorithms on a higher level of abstraction, factoring away the boilerplate threading code, thread pools, task queuing/scheduling and other patterns that prevail in multithreaded code. Furthermore, the goal is to do so in a platform-independent and efficient way. 

It is a pure C++ library, meaning that it uses whatever capabilities the existing platforms and compilers already provide (in contrast to the OpenMP standard, amongst others). 

Getting back to our original Game of Life main loop, using the TBB library we can make it parallel like so: 

#include <tbb/tbb.h>
// ...
//  evolves all cells one epoch, given the previous epoch state in prev_board
void game_of_life_board::evolve(const game_of_life_board& prev_board) {
    //  for each cell, compute in parallel its state for the next epoch, 
    //  based on the number of the alive neighbours
    tbb::parallel_for(0, num_cells(), [&](int i) {
        cell[i] = evolve_cell(prev_board.cell[i], 
            prev_board.num_neighbors(i));
    });
}
 
As you can see, this time it looks very similar to the serial (non-parallel) version, which is, of course, a good thing. 

It uses the parallel for primitive; in practice it's a workhorse of parallel computations, especially the "embarrassingly parallel" ones. This code example also takes advantage of C++11 lambda expressions syntax, which allows to represent the callback function for the loop body in a concise (albeit somewhat unusual) way. 

A side note regarding the lambda expressions: if your C++ compiler does not support them (even though nowadays most of them do), then it's possible to use C++ functors. However, the syntax gets a bit more verbose: 

struct EvolveFn {
    //  captured data
    const game_of_life_board&       prev_board;
    game_of_life_board&             _this;

    //  constructor
    EvolveFn(const game_of_life_board& prev_board_, game_of_life_board& _this_) 
        : prev_board(prev_board_), 
        _this(_this_) {}

    //  the loop body
    void operator ()(int i) const {
        _this.cell[i] = _this.evolve(prev_board.cell[i], 
            prev_board.num_neighbors(i));
    }
};

//  evolves all cells one epoch, given the previous epoch state in prev_board
void step_epoch(const game_of_life_board& prev_board) {
    //  for each cell, compute in parallel its state for the next epoch, 
    //  based on the number of the alive neighbours
    tbb::parallel_for(0, num_cells(), EvolveFn(prev_board, *this));
}
 

If that looks too bad, it's possible to come up with some macro hacks and instead of directly using a C++ functor have something like this: 

//  evolves all cells one epoch, given the previous epoch state in prev_board
void step_epoch(const game_of_life_board& prev_board)
{
    //  for each cell, compute in parallel its state for the next epoch, 
    //  based on the number of the alive neighbours
    game_of_life_board* _this = this;
    LAMBDA_OPEN1(lambda, int,i)
    {
        _this->cell[i] = _this->evolve(prev_board.cell[i], 
            prev_board.num_neighbors(i));
    }
    LAMBDA_CLOSE2(lambda, 
        const game_of_life_board,prev_board, game_of_life_board*, _this);

    tbb::parallel_for(0, num_cells(), lambda);
}
 

Moving on, there is another basic high-level pattern that is useful: "run these different computations in parallel and wait for all of them to finish before continuing". This is what task groups in TBB are used for. The earlier Game Engine example code now becomes: 

#include <tbb/tbb.h>

void GameWorld::Update() {
    tbb::task_group tg;    

    tg.run([&]() { StepAnimations(); } 
    tg.run([&]() { StepParticles();  }
    tg.run([&]() { StepPhysics();    }

    tg.wait(); 
}

void GameWorld::StepAnimations() {
    //  update skeletal animations for every currently visible model
    tbb::parallel_for(0, m_Animations.size(), [&](int i) {
        //  ...
    });
}

void GameWorld::StepParticles() {
    //  update particle systems
    tbb::parallel_for(0, m_Particles.size(), [&](int i) {
        //  ...
    });
}

void GameWorld::StepPhysics() {
    //  perform physics integration for every disjoint physics "island"
    tbb::parallel_for(0, m_PhysicsIslands.size(), [&](int i) {
        //  ...
    });
}

Once again we have code that is similar to the original serial version yet this time it is parallel. TBB is taking care of the gritty scheduling details behind the scenes and trying to do so in an efficient way.

 

The magic that happens behind the scenes

We tell TBB about the boundaries of the "tasks" and the dependencies between them. In the previous example, we did it explicitly with the tasks group and implicitly with the parallel_for loops (leaving TBB to decide how to partition the iteration space into different chunks of work). 

These tasks are then arranged in correct order between different threads from a thread pool. The "task queue" concept is also implemented, with one notable difference: each thread has its own task queue instead of using a single global one. 

This solves the problem with access contention to a global task queue but adds an additional danger that some threads might empty their queues early and become idle. To mitigate this, TBB implements so-called task stealing so when a thread runs out of tasks it tries to "steal" them from the back of some other thread's queue. There is a bunch of heuristics and fine-tuning involved in this process to minimize contention and improve memory cache coherence. 

For the Game of Life example, with the help of TBB we obtain equally good utilization of all available CPUs (and good computation speed) as the original manual multithreading attempt. Except we use considerably less code and the result is more scalable. 

For the Game Engine example the average performance of the TBB solution in most cases will be actually better than the manually crafted one because of TBB's adaptive load balancing capabilities.

 

A bird's-eye view of the library

Speaking generally about the whole library, there are several components at different levels of abstraction with higher levels being based on the lower levels' functionality:

As may already be clear, the task scheduler is at the heart of TBB, taking care of scheduling the chunks of work ("tasks"). These tasks are implicitly organized into a DAG (directed acyclic graph). If there is an edge from task A to task B in this graph, it means that "task A depends on finishing task B before it can proceed".

Another group of low-level facilities is synchronization primitives, which have mutexes and atomics. There are several flavors of mutexes, starting from lightweight spinlocks, that can be extremely efficient in low contention scenarios, and ending with the platform-independent wrappers of OS-level mutexes (the "heaviest" ones). Atomics is an abstraction of lightweight atomic operations, usually supported by hardware. 

Containers is a set of classes that implement thread-safe counterparts of the containers in C++ standard library (with somewhat different API, since the original one was not designed with multithreading in mind and is not particularly suited for that).  

Algorithms include the patterns that we've already seen ("parallel for", "task groups"), and a bunch of other ones in addition, such as "parallel reduce", "pipeline", "parallel sort".

Flow graph is the highest level concept (and a relatively new addition), which allows to explicitly build data processing graphs from different building blocks on the fly.

 Additionally, there are various low-level utilities, such as:
  • Efficient memory allocator, fine-tuned for heavily multithreaded scenarios.
  • System threads and thread local storage wrappers.
  • Robust and thread-safe timing facilities.
  • Cross-thread exception handling (the C++ standard, at least before C++11 does not provide any means for handling exceptions that are thrown in different threads).
Of course, much more information can be found in the user documentation.

 

Licensing and portability

The library is dual-licensed: there is a commercial license (which provides support), and there is an open-source one (GPLv2). There is no difference in features between two; the source code of the whole library is available to the curious reader. 

 It's worth mentioning that the open source version has so-called runtime exception, which means that linking dynamically with the library does not automatically force the resulting executable into GPLv2 licensing, essentially meaning that open source version of TBB can be used in closed-software applications. 

The officially supported platforms are Linux (and most of the distributions have the library included), Windows, MacOS. The supported compilers are the Intel compiler (naturally), GCC and Microsoft Visual Studio. Additionally, since the library is open source, there are several community patches to bring support to other platforms, such as FreeBSD, QNX, Xbox360 etc. The Solaris OS and Solaris Studio compilers are also (unofficially) supported but with some limitations.

Final words

TBB is not the only library out there that provides useful abstractions for multithreaded programming in C++, and it's not necessarily the one you need. There are quite a few others on the landscape and it helps to be aware of them.


As often happens with silver bullets, they don't exist. TBB is designed to be used in processing-heavy applications; it's not the library to use when, for example, most of your tasks are spending time blocked on IO (for example, waiting on a network socket). It does not help in any way with distributed processing, when computations are supposed to be spread between different nodes. Since threads still work with shared memory, it does not guarantee that your program will be race condition and deadlock free (however, it can reduce the possibility of these common errors, since the amount of shared state can be minimized thanks to the task-level parallelization). 

So, as always, it's a matter of common sense and having at least a general understanding of the library's internals. Armed with that, Intel Threading Building Blocks can be a very powerful tool for implementing computation-heavy parallel algorithms.  

11/16/12

Mahjong solitaire puzzle game in F#/WPF

So, a little learning experiment in F#/WPF.

It's a Mahjong solitaire puzzle, with the (already somewhat familiar) "programming languages" twist.

The images on the stones correspond to different programming languages. The player can open a web page with the information about the language on the currently selected stone, by clicking an URL in the status bar.

In the cheap pun tradition the code name for the game would be "Mahjolon" (Mahjong, Babylon, many languages... see?)

Being a side effect of slow background learning of F#, it's neither idiomatic nor elegant, not a "tutorial" or anything... not "how to do things", but rather "how I did them at the first attempt".

In fact, it's even not a program, but a script (.fsx) with everything lumped together - the result of experimental type of development, when you write small pieces of code, evaluate them on the fly in the REPL (AKA "F# interactive") and almost instantly see what happens.

But I figured I'd reflect on this anyway, so here we go (btw, the code is available on github).

We have the following files (all in the same folder):

  • mahjong.fsx - the game script itself. It can be run, for example, by opening it in Visual Studio (I used Visual Studio 2010), pressing Ctrl+Alt+F to open F# interactive, Ctrl+A to select all code and then Alt+Enter to evaluate it.
  • mahjong.xaml - the minimal WPF XAML file with the main (and only) window layout
  • languages.txt - the list of programming language name/URL pairs in every line, separated by "|", e.g:
http://en.wikipedia.org/wiki/Lua_(programming_language)|Lua http://en.wikipedia.org/wiki/Erlang_(programming_language)|Erlang http://en.wikipedia.org/wiki/Clojure|Clojure http://factorcode.org/|Factor
  • layouts.txt - text file, describing different initial board layouts
  • stones_bg.png - sprite atlas for stone backgrounds (normal stone and a selected one)
  • stones_fg.png - sprite atlas with stone "faces", that are in my case programming language logos

The script itself, mahjong.fsx goes like this.

Imports

At the beginning, we import the used assemblies, the "script way". The ones here are needed in order to use WPF.


#r "WindowsBase"
#r "PresentationCore"
#r "PresentationFramework"
#r "System.Xaml"

open System
open System.Windows
open System.Windows.Controls
open System.Windows.Shapes
open System.Windows.Media
open System.Windows.Media.Imaging
open System.Windows.Markup
open System.Xml
open System.IO

Here we first tell which (WPF) assemblies we will use, and then import a few particular namespaces from those assemblies.

General utilities

F# has an extensive set of library functions for manipulating with containers, but sometimes it's not enough and we'd want to define our own.

The first one tries to apply function to the argument until it returns Some() option (or the number of attempts reaches maxDepth, in which case the total result is None). It's a recursive function, using tail recursion. We'll need it for e.g. trying to shuffle the stones in such a way that the board is solvable.

module Utils =

  let rec tryApply maxDepth f arg = 
    match maxDepth, (f arg) with
    | 0, _ -> None
    | _, Some s -> Some s
    | _, None -> tryApply (maxDepth - 1) f arg

Now we use F# ability to extend existing modules and add some extra array manipulation functions.

Function choosei is the same as Array.choose, but also passes the index into the predicate:

module Array =
  let choosei f (array: _[]) =
    let res = new System.Collections.Generic.List<_>()
    for i = 0 to array.Length - 1 do 
      let x = array.[i] 
      match f (i, x) with
      | Some v -> res.Add(v)
      | None -> ignore()
    res.ToArray()

It is implemented by peeking into the source code of the core F# library and modifying it to our needs (that's why it might seem weird at the first glance... the core F# library code tries to do things in a way that has a good enough performance, which may not necessarily look like very idiomatic F#).

Function shuffle returns a shuffled array. Internally, it creates the copy of the input array and after that does the good old in-place Fischer-Yites shuffling.

  let shuffle arr =
    let a = arr |> Array.copy
    let rand = new System.Random()
    let flip i _ =
      let j = rand.Next(i, Array.length arr)
      let tmp = a.[i]
      a.[i] <- a.[j]
      a.[j] <- tmp
    a |> Array.iteri flip
    a

The function tryFindLastIndexi searches for the last element in the array satisfying the given predicate, returning None if there is no such element, and Some(el) if it is found:

  let tryFindLastIndexi f (array : _[]) = 
    let rec searchFrom n = 
      if n < 0 then None 
      elif f n array.[n] then Some n else searchFrom (n - 1)
    searchFrom (array.Length - 1)

Function replicate takes a subarray of the given input array, repeats it numTimes times in a row and returns the resulting array.

  let replicate firstElem numElem numTimes arr = 
    let res = Array.zeroCreate (numElem*numTimes) 
    for i in [0..(numTimes - 1)] do Array.blit arr 0 res (i*numElem) numElem 
    res

The last utility function, fullPath is rather a hack, applicable only when we are doing this kind of quick-and-dirty REPL development. It returns a full path to file, assuming that this file is located in the same folder as the current script (which later will be used to load data files for the game).

let fullPath file = __SOURCE_DIRECTORY__ + "/"+ file

Types

Here are the custom types our code will work with.

The first one, SpriteAtlas is used to bundle together information about a "sprite sheet" (we have two of these - for background and foreground stone image components):

type SpriteAtlas = { 
  File: string
  Cols: int
  Rows: int
  FrameWidth: float
  FrameHeight: float
}

The next, a discriminated union called StoneState, describes the possible three states in which every stone can be. Note that the stone becomes "Hidden" after the player "removes" it.

type StoneState =
  | Visible
  | Selected
  | Hidden

The main data structure, Game is a bundle of all data describing one game session. Most of it is mutable (changes during a single game session).

Note, that even though some of the fields may not appear to be mutable, in fact they are! This is one of the darker sides of the language. Since arrays and .NET objects are implicitly mutable, you basically never know if their elements have been changed or not.

Thus, only StoneCoords are never de-facto changed after the object of type Game is created.

type Game = { 
  StoneCoords: (int*int*int)[]
  StoneIDs: int[]
  StoneStates: StoneState[]
  StoneControls: (Rectangle*Rectangle)[] 
  mutable Moves: int list
  mutable CurSelected: int option
  mutable NumHiddenLayers: int
}

Global constants

Since a quick-and-dirty approach does not completely intolerate hardcoded values, we have a few here as well.

To somewhat silence the feeling of guilt I've tried making them more explicit by bundling all together and naming with caps.

let STONE_EXTENTS = 66., 78.
let STONE_3D_OFFSET = -7., -11.

let MAIN_WINDOW_XAML = "mahjong.xaml"
let LAYOUTS_FILE = "layouts.txt"
let LANGUAGES_FILE = "languages.txt"

let BG_ATLAS = { File = "stones_bg.png"; Cols = 2; Rows = 1; FrameWidth = 75.; FrameHeight = 90.; }
let FG_ATLAS = { File = "stones_fg.png"; Cols = 12; Rows = 10; FrameWidth = 64.; FrameHeight = 60.; }

let MAX_ARRANGE_ATTEMPTS = 50
let ABOUT_URL = @"http://en.wikipedia.org/wiki/Mahjong_solitaire"

Loading data from files

There are two types of game data that we load from files: board layout description and stone face description (the name of the corresponding programming language and URL of what is usually a Wikipedia page).

The following are the functions that are reading these data from files and parse them into data structures to be used by the game logic.

The file layouts.txt has data about several initial board layouts and looks like this:

Each 2x2 ASCII cell corresponds to one stone, and the number inside the cell equals to the highest stone level at the corresponding location. Since there should always be a stone below another stone (except for level 0), such representation describes the board non-ambiguously.

Note that the 2x2 cell representation is used in order to be able to describe those stones which have a half-stone position offset (in any or both directions).

So, the following function, parseLayouts, converts the text representation of a single board layout into array of tuples (column, row, layer) for every stone on the board. The column and row are using the same 0.5-base coordinate system (i.e. one stone occupies 2x2 cells).

The function internally uses an array (doing destructive updates on it), "peeling" out the stone coordinates layer-by-layer, starting from the top one.

let parseLayout (str:string) =   
  let charToLayer ch = 
    match Int32.TryParse(string ch) with
    | (true, n) -> n
    | _ -> -1
  let strToRow (s:string) =
    s.TrimEnd(' ').ToCharArray()
    |> Array.map charToLayer 
  let layout = 
    str.Split('\n') 
    |> Array.map strToRow
    |> Array.filter (fun r -> r.Length <> 0)
  let maxLayer = 
    layout 
    |> Array.maxBy Array.max 
    |> Array.max
  seq {
    let l = Array.copy layout
    let blockOffs = [|0, 0; 1, 0; 0, 1; 1, 1|]
    for layer in maxLayer .. -1 .. 1 do
      for row in 0 .. layout.Length - 2 do
        for col in 0 .. layout.[row].Length - 2 do
        let isBlock = Array.forall (fun (x,y) -> l.[row + x].[col + y] = layer) blockOffs
        if isBlock then
          yield (col, row, layer)
          blockOffs |> Array.iter (fun (x, y) -> l.[row + x].[col + y] <- layer - 1) 
  } |> Seq.toArray

Then, the value layouts would contain the list of possible boards. It takes the input file, splits out the board descriptions and parses every of them:

let layouts = 
  let splitSections (res, s) (line:string) = 
    if line.StartsWith("-") then (s::res, "") else (res, s + "\n" + line)
  seq {
      use sr = new StreamReader(fullPath LAYOUTS_FILE)
      while not sr.EndOfStream do yield sr.ReadLine()
  } |> Seq.fold splitSections ([],"") 
    |> fst 
    |> List.rev 
    |> List.filter (fun l -> l.Length > 0) 
    |> List.toArray 
    |> Array.choosei (function | i, x when i%2 = 1 -> Some x | _ -> None)
    |> Array.map parseLayout
    |> Array.map (Array.sortBy (fun (x, y, h) -> x + y + h*1000))

Another value, languages contains the array of Name/URL tuples for all the possible stone faces.

let languages = 
  seq {
        use sr = new StreamReader(fullPath LANGUAGES_FILE)
        while not sr.EndOfStream do yield sr.ReadLine()
    } 
    |> Seq.map (fun s -> s.Trim().Split('|'))
    |> Seq.filter (fun el -> el.Length = 2)
    |> Seq.map (fun el -> (el.[0], el.[1]))
    |> Seq.toArray

Game logic (functional part)

The function isFree is used to find out if stone with given coordinates is "free", i.e. it can be removed from the board. By Mahjong solitaire rules, the stone is free when it does not have other stones on top of it and also there is no other stones to either left or right side (or both).

This function is somewhat unusual - instead of immediately returning true/false it returns another function instead, and that function can be used to test if the stone is free.

The reason for this is that we employ here a technique called memoization: we internally create a hash map that given the cell coordinates (in 0.5-stone based coordinate system) gives us the index of the stone, this hash map is "remembered" via a closure and can be reused for all the subsequent lookups.

The function that is returned also takes the array of current stone states, so that we are taking into account already removed stones (they should not block anything).

let isFree coords = 
  //  create hash table with coordinates to quickly find neighbors
  let m = System.Collections.Generic.Dictionary()
  let addStone i (x, y, h) = 
    [|0, 0; 1, 0; 0, 1; 1, 1|]
    |> Array.iter (fun (dx, dy) -> m.Add((x + dx, y + dy, h), i))
  coords |> Array.iteri addStone
  fun (states:StoneState[]) stoneID ->
    let (x, y, h) = coords.[stoneID]
    let isBlockedBy offsets = 
      let isBlocking (dx, dy, dh) =
        let key = (x + dx, y + dy, h + dh)
        if m.ContainsKey key then states.[m.[key]] <> Hidden else false
      offsets |> List.exists isBlocking
    let top = [0, 0, 1; 0, 1, 1; 1, 0, 1; 1, 1, 1]
    let left = [-1, 0, 0; -1, 1, 0]
    let right = [2, 0, 0; 2, 1, 0]
    not (isBlockedBy top || (isBlockedBy left && isBlockedBy right))

The next function, getFree uses the previous one to get the current list of free stones. Note that F# would "take care" that the costly isFree coords is called only once and it's just the the resulting function that is reapplied (curried) to all of the coordinates.

let getFree coords (states:StoneState[]) =
  [0 .. states.Length - 1]
  |> List.filter (fun i -> states.[i] <> Hidden)
  |> List.filter (isFree coords states)

The function getMatches returns the list of stones that can be removed: they are both free and have a matching pair which is also free:

let getMatches (ids:int[]) coords states = 
  let free = getFree coords states 
  free
  |> Set.ofList 
  |> Set.filter (fun i -> 
    (List.sumBy (function 
      | a when ids.[a] = ids.[i] -> 1 
      | _ -> 0) free) > 1)
  |> Seq.toList

The next two functions, arrangeRandom and tryArrangeSolvable, both shuffle the given arrays of stones in a random way.

The difference between them is that the first one is doing this plain randomly, while the second one tries to create the order that is both random and solvable.

Most of the existing Mahjong Solitaire implementations have a "shuffle" button, but many of them do it without caring if the resulting shuffled layout is actually solvable at all. Which can be very annoying.

So the "shuffle solvable" algorithm does the following:

  1. finds all the face type pairs that are left to be removed on the board and shuffles them
  2. "clears" all the visible stones
  3. takes a random pair of "free" stones (those that can be removed)
  4. assigns the next face type from the array found at the step 1 to the chosen random pair of free stones
  5. goes to the step 3, until all of the remaining stones are assigned faces or until there is only one free stone left (in which case the algorithm fails)

It is possible, that the current board layout does not allow to create the solvable position at all (as a simple example, imagine that there are only two stones left, one on top of another).

That's why the shuffling is attempted several times before giving up. In case when there is only one free stone remains on step 5, we don't know for sure if that's because of the unsolvable layout or because of the way we've been choosing the stone pairs on step 3.

let arrangeRandom (coords:(int*int*int)[]) (stoneTypes:int[]) = 
  Some(stoneTypes |> Array.shuffle)

let tryArrangeSolvable (coords:(int*int*int)[]) (stoneTypes:int[]) = 
  let stonePairTypes =
    stoneTypes
    |> Array.sort
    |> Array.choosei (function | i, x when i%2 = 0 -> Some x | _ -> None)
    |> Array.shuffle
  let isFree' = isFree coords
  let s = seq { 
    let states = [|for x in 1 .. coords.Length do yield Visible|]    
    for c in stonePairTypes do
      let nextFree = 
        [0..(coords.Length - 1)]
        |> List.filter (fun x -> states.[x] = Visible)
        |> List.filter (isFree' states)
        |> List.toArray
        |> Array.shuffle
        |> Seq.truncate 2
        |> Seq.map (fun x -> (x, c))
      nextFree |> Seq.iteri (fun i (x, c) -> states.[x] <- Hidden)
      yield! nextFree
  } 
  let ids = s |> Seq.toArray |> Array.sortBy fst |> Array.map snd 
  let numStones = Seq.length stoneTypes
  if ids.Length <> numStones then None else Some ids

Then, the shuffleVisible function does the actual shuffling via a "shuffle function" (that can be one of the two previous functions) provided as an argument:

let shuffleVisible coords (ids:int[]) (states:StoneState[]) shuffleFn =
  let isVisible = (function | i, id when states.[i] = Visible -> Some id | _ -> None)
  let shuffled = 
    ids
    |> Array.choosei isVisible
    |> Utils.tryApply MAX_ARRANGE_ATTEMPTS (shuffleFn (Array.choosei isVisible coords))
  match shuffled with
  | Some (s:int[]) -> 
    ids 
    |> Array.zip3 states [|0..(states.Length - 1)|] 
    |> Array.filter (fun (st, _, _) -> st = Visible)
    |> Array.iteri (fun i (_, idx, _) -> ids.[idx] <- s.[i])
    true
  | None -> 
    MessageBox.Show "Not possible to create solvable position!" |> ignore
    false

Function getMaxLayer just returns the highest layer number used in the current board layout:

let getMaxLayer coords = 
  let  _, _, m = coords |> Array.maxBy (fun (_, _, h) -> h)
  m

UI/Drawing

We'll have a minimalistic GUI with a main menu and a status bar. The layout is described in a XAML file (mahjong.xaml) that goes like this:

<Window xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
        Title="F#/WPF Mahjong Solitaire" Background="#4f3d22" Width="1024" Height="768">
  <Grid>
    <Grid.RowDefinitions>
      <RowDefinition Height="Auto"/>
      <RowDefinition Height="*"/>
      <RowDefinition Height="Auto"/>
    </Grid.RowDefinitions>
    <Menu  Grid.Row="0">
      <MenuItem Header="Game">
        <MenuItem Header="New">
          <MenuItem Name="MenuRandom" Header="Random"/>
          <Separator/>
          <MenuItem Name="MenuTurtle" Header="Turtle"/>
          <MenuItem Name="MenuDragon" Header="Dragon"/>
          <MenuItem Name="MenuCrab" Header="Crab"/>
          <MenuItem Name="MenuSpider" Header="Spider"/>
        </MenuItem>
        <Separator/>
        <MenuItem Name="MenuUndo" Header="Undo"/>
        <MenuItem Name="MenuShuffle" Header="Shuffle"/>
        <MenuItem Name="MenuShuffleSolvable" Header="Shuffle Solvable"/>
        <Separator/>
        <MenuItem Name="MenuExit" Header="Exit"/>
      </MenuItem>
      <MenuItem Header="View">
        <MenuItem Name="MenuShowFree" Header="Show Free"/>
        <MenuItem Name="MenuShowMatches" Header="Show Matches"/>
        <Separator/>
        <MenuItem Name="MenuHideLayer" Header="Hide layer"/>
        <MenuItem Name="MenuUnhideLayer" Header="Unhide layer"/>
      </MenuItem>
      <MenuItem Header="Help">
        <MenuItem Name="MenuAbout" Header="About"/>
      </MenuItem>
    </Menu>
    <Canvas Grid.Row="1" Name="BoardCanvas" Margin="35,60,30,50" />
    <StatusBar Grid.Row="2">
      <StatusBarItem>
        <Button Name="StoneInfoURL">
          <Button.Template>
            <ControlTemplate TargetType="{x:Type Button}">
              <ContentPresenter/>
            </ControlTemplate>
          </Button.Template>
          <TextBlock Name="StoneName" Cursor="Hand" VerticalAlignment="Bottom"
                     Foreground="Blue" TextDecorations="Underline"/>
        </Button>
      </StatusBarItem>
    </StatusBar>
  </Grid>
</Window>

We load the XAML file, creating the corresponding WPF Window object named window and call Show() to it:

let loadXamlWindow (filename:string) =
  let reader = XmlReader.Create(filename)
  XamlReader.Load(reader) :?> Window

let window = loadXamlWindow(fullPath MAIN_WINDOW_XAML)
window.Show()

The next function, spriteBrush, creates a WPF image brush corresponding to the given frame in the given sprite atlas:

let spriteBrush atlas id =
  let imgSource = new BitmapImage(new Uri(fullPath atlas.File))
  let stoneW, stoneH = 1./(float atlas.Cols), 1./(float atlas.Rows)
  let viewBox = new Rect(stoneW*(float (id % atlas.Cols)), 
                         stoneH*(float (id / atlas.Cols)), stoneW, stoneH)
  new ImageBrush(ImageSource = imgSource, Viewbox = viewBox)

We will represent every stone's visual with a pair of WPF Rectangle objects: one for the background image, and one for the face image on top of it (yes, it's far from being an efficient/elegant way of doing this, but it will do for now).

The next function, updateStoneControl takes this pair of Rectangle's and changes their brushes such that the stone visual corresponds to both stone face type and stone state:

let updateStoneControl stoneControl id state = 
  let (bg:Rectangle), (fg:Rectangle) = stoneControl
  let spriteID = if state = Selected then 1 else 0
  let selBrush = spriteBrush BG_ATLAS spriteID
  match state with
  | Hidden -> bg.Fill <- null; fg.Fill <- null
  | _ -> bg.Fill <- selBrush
  let imgBrush = 
    match state with
    | Hidden -> null
    | _ -> spriteBrush FG_ATLAS id
  fg.Fill <- imgBrush

Similarly, setStoneOpacity updates only transparency for the given stone visual (we'll need it for the "hide layer" feature):

let setStoneOpacity opacity (fg:Rectangle, bg:Rectangle) =
    fg.Opacity <- opacity; bg.Opacity <- opacity;

Function getStoneLocation computes the position of the stone visuals on the canvas, given the stone coordinates in the 0.5-based coordinate system. It takes into account the layer number as well, simulating the "3D offset" (the higher the layer is, the bigger is the offset):

let getStoneLocation (i, j, layer) =
  let lx, ly = STONE_3D_OFFSET
  let sx, sy = STONE_EXTENTS
  let x = (float i)*sx*0.5 + (float layer)*lx
  let y = (float j)*sy*0.5 + (float layer)*ly
  (x, y, sx - lx, sy - ly)

Finally, the very function that creates the initial pairs of WPF Rectangle's for every stone, createStoneControls.

It adds the rectangles as children to the "BoardCanvas", the Canvas control that is described in the XAML file and created when the XAML was loaded:

let createStoneControls stoneDataArr =
  let createControlPair (id, (i, j, layer)) =
    let (x, y, _, _) = getStoneLocation(i, j, layer)
    let bg = new Rectangle(Width = BG_ATLAS.FrameWidth, Height = BG_ATLAS.FrameHeight)
    Canvas.SetLeft(bg, x)
    Canvas.SetTop(bg, y)   
    let fg = new Rectangle(Width = FG_ATLAS.FrameWidth, Height = FG_ATLAS.FrameHeight)
    Canvas.SetLeft(fg, x + 2.)
    Canvas.SetTop(fg, y + 10.)
    let controls = (bg, fg)
    updateStoneControl controls id Visible
    controls

  let canvas = window.FindName("BoardCanvas") :?> Canvas
  canvas.Children.Clear()
  stoneDataArr 
    |> Array.map createControlPair 
    |> Array.map (fun (bg, fg) -> 
                    canvas.Children.Add(bg) |> ignore
                    canvas.Children.Add(fg) |> ignore
                    (bg, fg))

Game logic (imperative part)

What comes next is imperative code that implements player actions via mutating the current game state object (of type Game, described earlier).

The first function, newGame, returns a new pristine game state object with given board layout. That (mutable) object, game is further created.

There is also a helper function startGame, which wraps the new game object creation.

let newGame layoutID =
  let coords = layouts.[layoutID]
  let states = Array.init coords.Length (fun _ -> Visible)
  let ids = 
    [|0..(Array.length languages - 1)|] 
    |> Array.shuffle 
    |> Array.replicate 0 ((Array.length coords)/4) 4
  shuffleVisible coords ids states tryArrangeSolvable |> ignore
  let controls = 
    coords
    |> Array.zip ids
    |> createStoneControls
  { StoneCoords =  coords
    StoneStates = states
    CurSelected = None
    Moves = []
    StoneIDs = ids
    NumHiddenLayers = 0
    StoneControls = controls }

let mutable game = newGame ((new System.Random()).Next(0, Array.length layouts))

let startGame id = 
  fun _ -> game <- newGame id

The next bunch of functions do not do much per se, wrapping different possible player actions:

  • shuffleStones - shuffling the visible stones via given shuffle method
  • setStoneState - change the state of the given stone
  • undoMove - undo the last move
  • unselectAll - unselects all the currently selected stones
  • showFreeStones - select all the stones that are currently "free"
  • showMatchingStones - select all the stones that are both "free" and have free matching stones

Every of them has explicit "update the stone visual" as the last step, because every of these actions would affect stones appearence in some way.

Now, while this is straightforward, it does not seem to be elegant at all. It feels that there should be some better way of handling the propagation of state change to get reflected by the visual appearance. But let's return to it later and stick to what kind of works for now.

let shuffleStones shuffleFn = 
  let shuffled = shuffleVisible game.StoneCoords game.StoneIDs game.StoneStates shuffleFn
  //  update stone controls
  game.StoneStates 
  |> Array.iteri (fun i state -> updateStoneControl game.StoneControls.[i] game.StoneIDs.[i] state)
  shuffled
let setStoneState state idx =
  game.StoneStates.[idx] <- state
  updateStoneControl game.StoneControls.[idx] game.StoneIDs.[idx] state
let undoMove () = 
  match game.Moves with
  | a::b::rest -> game.Moves <- rest; setStoneState Visible a; setStoneState Visible b;
  | _ -> ()
let unselectAll () = 
    game.CurSelected <- None
    [|0 .. game.StoneCoords.Length - 1|] 
    |> Array.filter (fun i -> game.StoneStates.[i] <> Hidden) 
    |> Array.iter (setStoneState Visible)
let showFreeStones () =  
  unselectAll ()
  getFree game.StoneCoords game.StoneStates |> List.iter (setStoneState Selected)
let showMatchingStones () = 
  unselectAll ()
  getMatches game.StoneIDs game.StoneCoords game.StoneStates |> List.iter (setStoneState Selected)

Function hideLayers changes the transparency of +/-delta given stone layers, such that it's possible to see what's beneath the top layers.

This feature tries to make our Mahjong Solitaire puzzle into a game with full information (otherwise it generally would not be possible to deterministically come up with a proper stone removal strategy without seeing what's beneath the top stones).

let hideLayers delta = 
  let newHiddenLayers = game.NumHiddenLayers + delta
  let maxLayer = getMaxLayer game.StoneCoords
  if newHiddenLayers >= 0 && newHiddenLayers < maxLayer then
    game.NumHiddenLayers <- newHiddenLayers
    game.StoneCoords |> Array.iteri (fun i (_, _, h) -> 
      let alpha = (if h <= maxLayer - newHiddenLayers then 1.0 else 0.2)
      setStoneOpacity alpha game.StoneControls.[i])

removeStonePair removes the given (matching) pair of stones and also analyses the board layout after that, reacting correspondingly:

let removeStonePair s1 s2 = 
  unselectAll () 
  setStoneState Hidden s1 
  setStoneState Hidden s2 
  game.Moves <- s1::s2::game.Moves;
  if (game.StoneStates |> Array.forall (fun st -> st = Hidden)) then 
    MessageBox.Show "Amazing, you've won in this impossible game!" |> ignore
    startGame 0 |> ignore
  elif ((getMatches game.StoneIDs game.StoneCoords game.StoneStates).Length = 0) then 
    if (shuffleStones tryArrangeSolvable) then
      MessageBox.Show "No more possible moves. Shuffling the remaining stones." |> ignore
    else
      MessageBox.Show "No more possible moves and not possible to shuffle." |> ignore
      startGame 0 |> ignore

Function clickStone handles the event of a stone with given index having been clicked. Depending on the context, that can result in either selecting the stone, or stone pair removal, or unselecting all the stones etc.

let clickStone stone = 
  let selectStone s =
    setStoneState Selected s
    game.CurSelected <- Some(s)
    let url, lang = languages.[game.StoneIDs.[s]]
    let status = window.FindName("StoneName") :?> TextBlock
    status.Text <- lang
  let s = match stone with
              | Some stoneIdx when (isFree game.StoneCoords game.StoneStates stoneIdx) -> Some stoneIdx
              | _ -> None
  match s, game.CurSelected with
  | Some c, Some s when c = s -> unselectAll ()
  | Some c, Some s when game.StoneIDs.[c] = game.StoneIDs.[s] -> removeStonePair c s
  | Some c, None -> selectStone c
  | _, _ -> unselectAll ()

UI/Input

The final part is responsible for the input: mouse clicks on the board and menu item clicks.

The mouse clicks are handled by attaching custom function (handleMouseClick) to the stream of filtered "mouse down" events sent to the board canvas control:

let handleMouseClick (mx, my) = 
  let stone = 
    game.StoneCoords 
    |> Array.map getStoneLocation
    |> Array.tryFindLastIndexi (fun n (x, y, w, h) -> 
      game.StoneStates.[n] <> Hidden && mx > x && my > y && mx < x + w && my < y + h)
  clickStone stone

let events = 
  let canvas = window.FindName("BoardCanvas") :?> Canvas
  window.MouseDown
  |> Event.filter (fun mi -> (mi.ChangedButton = Input.MouseButton.Left && 
                              mi.ButtonState = Input.MouseButtonState.Pressed))
  |> Event.map (fun mi -> (mi.GetPosition(canvas).X, mi.GetPosition(canvas).Y))
  |> Event.add handleMouseClick

Menu item clicks are bound to the corresponding functions (the binding itself happens via a helper function bindMenuItem):

let bindMenuItem (name, fn) =
  let menuItem = window.FindName(name) :?> MenuItem
  menuItem.Click.Add(fun _ -> fn ())

[ "MenuUndo", undoMove
  "MenuShuffleSolvable", fun _ -> shuffleStones tryArrangeSolvable |> ignore
  "MenuShuffle", fun _ -> shuffleStones arrangeRandom |> ignore
  "MenuShowFree", showFreeStones
  "MenuShowMatches", showMatchingStones
  "MenuHideLayer", fun _ -> hideLayers 1
  "MenuUnhideLayer", fun _ -> hideLayers -1
  "MenuRandom", startGame ((new System.Random()).Next(0, Array.length layouts))
  "MenuTurtle", startGame 0
  "MenuDragon", startGame 1
  "MenuCrab", startGame 2
  "MenuSpider", startGame 3
  "MenuExit", window.Close
  "MenuAbout", fun _ -> Diagnostics.Process.Start ABOUT_URL |> ignore
] |> List.iter bindMenuItem

The final few lines add the handler for a click on the currently selected stone URL "button" in the status bar (that would open a web-browser with the linked page):

let stoneURL = window.FindName("StoneInfoURL") :?> Button
stoneURL.Click.Add(fun _ -> 
                    match game.CurSelected with
                    | Some sel -> Diagnostics.Process.Start (fst languages.[game.StoneIDs.[sel]]) |> ignore
                    | None -> ())

That's all. There are many questions unanswered and things to improve here. I hope to address some of them later.