Dec 7, 2014

A story about an angry carrot and a floating point fairy

Prologue

Once upon a time, there was a kind and happy farmer, named Bob.

Every year he planted a row of ten thousand carrots, and that would yield a few truckloads of the crunchy vegetables.

Thing is, the carrots needed to be planted very carefully.

He was a firm believer that in order to get a great crop, one should plant the carrots at exactly 1.25 inches distance from each other, so that every carrot gets enough room for growth, but at the same time does not stand too far away from the neighbors.

Like this:

(also, the carrot row was offset from the start of the field by exactly 1254.8 inches).

Year after year, he was doing it manually, until once he discovered that he could learn to program computers and automate the carrot planting process with their help.

Bob's program to plant the carrots

Armed with this new knowledge and with a C++ compiler, he jumped right into writing the code (it did look like quite a simple task, mind you):

So, there was a piece of code to set up the constants:

const int NUM_CARROTS = 10000;      // total number of carrots to plant
const float OFFSET_WIDTH = 1254.8f; // distance to the start of the carrot patch (inches)
const float PLANT_INTERVAL = 1.25f; // the optimal planting interval (inches)

And the actual planting code:

float carrot_pos[NUM_CARROTS];   // carrot positions go here

// plants a single carrot
void plant_carrot(int idx) {
  carrot_pos[idx] = OFFSET_WIDTH + idx*PLANT_INTERVAL;
}

// plants all carrots
void plant_carrots() {
   for (int i = 0; i < NUM_CARROTS; i++) plant_carrot(i);
}

And the code to print the resulting carrot positions:

void print_carrots() {
  for (int i = 0; i < NUM_CARROTS; i++)
    printf("Carrot %d goes to %.2f inches\n", i, carrot_pos[i]);
}

Easy! When running, it would print:

Carrot 0 goes to 1254.80 inches
Carrot 1 goes to 1256.05 inches
Carrot 2 goes to 1257.30 inches
Carrot 3 goes to 1258.55 inches
...

Up until carrot number 9999. Precisely as Bob expected, totally matching his previous manual planting experience.

Inspired by the great success, Bob decided to pursue the software development best practices, and write a test for this code.

The test would take every carrot and check the distance to the previous one (this distance must be exactly 1.25 inches, as already mentioned):

// finds if a single carrot is planted correctly
bool is_planted_ok(int idx) {
  return (carrot_pos[idx] - carrot_pos[idx - 1]) == PLANT_INTERVAL;
}

// checks that all of the carrots are planted correctly
void test_carrots() {
  int num_bad = 0;
   for (int i = 1; i < NUM_CARROTS; i++) {
    if (!is_planted_ok(i)) {
      printf("The carrot %d is planted incorrectly!!!\n", i);
      num_bad++;
    }
  }
  if (num_bad == 0) printf("All carrots placed well!!!\n");
  else printf("%d of %d carrots placed incorrectly... :(\n", num_bad, NUM_CARROTS);
}

Meet the Angry Carrot

Confident, Bob run the test.

And imagine his dismay! Instead of the expected "All carrots placed well", to see this:

The carrot 2273 is planted incorrectly!!!
1 of 10000 carrots placed incorrectly... :(

The carrot number 2273! A single, an only one from 10000! Why, in the whole world?..

Not believing his own eyes, he tried and double checked the printout:

...
Carrot 2271 goes to 4093.55 inches
Carrot 2272 goes to 4094.80 inches
Carrot 2273 goes to 4096.05 inches
Carrot 2274 goes to 4097.30 inches
Carrot 2275 goes to 4098.55 inches
...

The numbers look perfectly fine on the screen! So why does the test fail???

Despaired, Bob turned to the online forums.

There were a lot of very helpful people there. They immediately suggested him to read a fundamental work called "What every carrot planter should know about carrot planting", where he'd presumably find answers to all of his questions.

Except he didn't; the fundamental work was undeniably full of wisdom (judging by the amount of Greek symbols in there, for one thing) but it never mentioned a single carrot, neither the possible repercussions of their planting thereof.

Suffice to say, that was a real disaster.

Meet the Floating Point Fairy

Indeed, that's when the Floating Point Fairy appears:

"Hello, Bob. I am the Floating Point Fairy, and I popped out from the thin air in order to help you. Or at least, try to. You are not the first one experiencing these troubles, you know."

Decimal in binary and inexact representation

Let's start from a revelation:

Number 1254.8 is not what what you think it is. Neither is 0.8, nor a lot of other normally-looking, nice decimal fractional numbers.

You know, 1254.8 is a decimal value, but inside computers all numbers are always binary (well, except maybe for Soviet Russia). And they use a special binary floating point representation: there is a bit for the sign, a few bits for the "exponent", and the rest is used for the so-called "mantissa"... but first let's see how do we get there.

There are certain rules of how this decimal to binary number conversion happens. We can look at the whole part (before the dot) and the fractional part (after the dot) separately: those two get simply combined afterwards, to get the resulting number in the binary representation.

Decimal to binary (whole numbers)

For the whole number the decimal->binary conversion algorithm is to iteratively divide the number by 2, until we get to 0. At every step, the modulo of division contributes to the final binary number (added from the left side).

For example, to convert decimal number 135 to binary we would perform a sequence of steps:

Current Decimal Current Binary
135
671
3311
16111
8 0111
4 00111
2 000111
1 1000111
10000111

Thus, 135 (Decimal) == 10000111 (Binary).

Note that the resulting binary representation is always finite (since the original decimal number is finite, and since we are dividing by 2, we'll eventually get to 0).

Decimal to binary (fraction, sum of powers of 1/2)

For the fractional part, the algorithm is a kind of reverse: we iteratively multiply by 2 (until we get to 1), and the leftmost digit of the multiplication result (the one that happens to be before the dot) contributes to the rightmost side of the resulting binary number.

For example, for the decimal 0.6875 the steps are:

Current Decimal Current Binary
0.6875
[1]0.3751
[0]0.75 10
[1]0.5101
[1]1011

Thus, 0.6875 (Decimal) == 0.1011 (Binary). Which is also finite.

Decimal to binary (fraction, inexact)

Now, let's look at another number, 0.8, and try to apply the same algorithm to it:

Current Decimal Current Binary
0.8
[1]0.61
[1]0.2 11
[0]0.4110
[0]0.81100
[1]0.611001
......

See what happens?.. While multiplying by 2, we keep cycling through 0.8, 0.6, 0.2, 0.4... it goes into an infinite loop!

Thus, 0.8 (Decimal) == 0.11001100(1100)...

Rounding errors

What happens to those infinite binary tails?.. They get chopped!

So, as you now see, what we store in the computer for those numbers, is not an exact representation, because the exact one in this case would take an infinite number of digits, and we can't afford that many. We only have 23 bits for the blue part (which is called mantissa), so we need to chop and round.

Different rounding rules may be used for this chopping. And that's how we get those red digits, which are, in a sense, garbage, since they don't exactly correspond to the number that we want.

Error cancellation (and the catastrophic one)

What may happen if we add a chopped number with the "red garbage" (OFFSET_WIDTH, which is 1254.8) to a number without garbage (multiple of PLANT_INTERVAL, which is 1.25)? It will always be a number with red garbage (i.e. non-exact number, affected by the rounding error).

What happens afterwards, when we subtract two such numbers (with the "red garbage")? In our particular case two things happen:

  • Either the red garbage tails from these two numbers miraculously "cancel" each other, so that we again incidentally get the number without garbage (and it's equal to 1.25). This happens in 9999 out of 10000 cases (again, for our particular example)
  • Or, in a single case, for the carrot number 2273 the bits in the subtracted garbage tails happen to be totally different (due to the rounding specifics) and what happens instead is so-called catastrophic cancellation, when garbage suddenly gets promoted to what is supposed to be significant digits.

The interesting thing is that carrots 2272 and 2273 are corresponding to the offsets 4094.80 and 4096.05. Note how 4096, which is 2^12, is right between them.

Anyway, the conclusion is: whenever there is an inexact decimal number representation involved (together with arithmetic operations), there is a high chance that you'll be badly affected by the rounding errors.

Let's fix it: Exact arithmetics

"I know!!!" - exclaimed Bob - "If the problem is that we get rounding errors because of the inexact number representation, just let's make sure that all the numbers are always exact, and we avoid rounding errors altogether!":

const float OFFSET_WIDTH = 1254.5f;  // <== NOTE: slightly changed, and now is exact in binary

And here we go, all carrots placed well!

"Sure," - replied the Fairy - "it worked for you now, but don't you think it's a fragile approach? You have to keep a close eye on every number and operation (granted, your arithmetic is very simple here)... besides, you changed the offset of the carrot patch, haven't you? Don't you think it might upset the carrots?"

Let's fix it: Use an epsilon

"Alright," - said Bob - "I know what I'll do. I won't compare the floating numbers exactly, but rather that they are close enough... that's what many people on the forums suggested, anyway".

"Will you, now?.. And how close do you think is close enough?" - mused the Fairy.

"Well... very close. I'll pick some small number with enough zeros after the dot":

const float EPSILON = 0.000001f;  // kind of small

bool is_planted_ok(int idx) {
  float  d = carrot_pos[idx] - carrot_pos[idx - 1];
  // check that numbers are not further away than EPSILON
  return (fabs(d - PLANT_INTERVAL) <= EPSILON); 
}

However, that didn't work. The angry carrot was still angry.

"Hmm... maybe that was too many zeros, after all. Let's remove one":

const float EPSILON = 0.00001f;   // should be big enough?..

...unfortunately, that did not work either.

"Alright, one more zero, then":

const float EPSILON = 0.0001f;    // maybe now it's big enough?..

Still failing! And that's already quite a big epsilon, mind you!

We can't keep removing zeros after the dot just like that!

Let's go more fine-grained:

const float EPSILON = 0.0002f;    // this MUST be enough

Guess what? It still did not help. Alright, a final attempt:

const float EPSILON = 0.0004f;    // WHAT ABOUT NOW?.. Q_Q 

Whew. That finally worked!

"But how reliable is that?" - asked the Fairy - "You sure it will work with different numbers? Are you going to pick this number every time you tweak your constants?"

Let's fix it: Relative epsilon

"Uhm. Nope, that does not feel right" - said Bob - "But I found a better way on the Internet! Let's use a relative epsilon":

const float EPSILON = 0.0002f;  // the "relative" epsilon 

inline bool float_close_relative(float x, float y, float eps) {
  return (fabs(x - y) <= std::max(fabs(x), fabs(y))*eps);
}

bool is_planted_ok( int idx) {
  float d = carrot_pos[idx] - carrot_pos[idx - 1];
  return float_close_relative(d, PLANT_INTERVAL, EPSILON);
}

"Alright, that worked." - said the Fairy - "But what if you bump your epsilon back down a notch?"

const float EPSILON = 0.0001f;    // should work as well, shouldn't it?..

"Oh no! What's going on? It's failing again!" - exclaimed Bob.

"You see," - answered the Fairy - "relative epsilon comparison only fixes one thing: that you'd need different EPSILON values when comparing big numbers and when comparing the numbers close to 1. In your case PLANT_INTERVAL is 1.25, which is already quite close to one, so while this relative epsilon formula is a generally good idea, it does not fix your problem... oh, and besides yours does not handle corner cases very well, but let's not go there."

Let's fix it: Bump the precision

"Alright, I know what else I could try." - said Bob after some contemplation - "I will just bump the precision of all numbers from single (32 bit) to double (64 bit) and see what happens":

const double  OFFSET_WIDTH = 1254.8;
const double PLANT_INTERVAL = 1.25;
double carrot_pos [NUM_CARROTS];

void plant_carrot(int idx) {
  carrot_pos[idx] = OFFSET_WIDTH + idx*PLANT_INTERVAL;
}

const double  EPSILON = 0.000001;
bool is_planted_ok(int idx) {
  double d = carrot_pos[idx] - carrot_pos[idx - 1];
  return (abs(d - PLANT_INTERVAL) <= EPSILON);
}

What happened is that it worked: the test passed. "Is this the solution, then?" - cautiously asked Bob.

"It may be." - answered the Fairy - "Sometimes. You have to realize, though, that by simply increasing the precision of your computations, you might as well just postpone your problem until later. Which is rarely a good thing. Besides, keep in mind that this way you use twice the memory for the data... which can affect the performance due to the increased cache misses, decreased SSE register utilization and what have you"

Let's fix it: Modify the formulas

"OK then, I can do it a smart way. Someone on the forums mentioned that one could take an algorithmic approach and modify the formulas, so that to avoid catastrophic cancellation and/or compensate for the rounding error":

const float EPSILON = FLT_EPSILON;
bool is_planted_ok(int idx) {
  float cp = carrot_pos[idx - 1] + PLANT_INTERVAL;
  return float_close_relative(carrot_pos[idx], cp, EPSILON);
}

"We have a subtraction with catastrophic cancellation inside the testing code itself, so instead of subtracting two neighboring carrot positions and comparing the result with the PLANT_INTERVAL, we will add the PLANT_INTERVAL to the previous carrot's position, and that should be close to the current carrot position!"

"Nice!" - smiled the Fairy - "And I've noticed that your epsilon is now not an arbitrary hand-picked number, but..."

"...but the smallest number such that 1.0 + EPSILON is not equal to 1.0 !" - proudly declared Bob.

"You are making a good progress. But could you find a way to avoid using epsilons altogether?.."

Let's fix it: An "arbitrary precision" library

"I did more research," - proclaimed Bob - "and found out that it's possible to do floating arithmetics directly in decimal format, avoiding the hardware binary representation. People do that for financial calculations (because that's the area where rounding errors translate directly in the lost money), for example there is BigDecimal class in Java. And there are libraries in other languages. So we could do something like:"

#include <boost/multiprecision/cpp_dec_float.hpp>

typedef boost::multiprecision::cpp_dec_float_100 decimal;

const decimal OFFSET_WIDTH("1254.8");
const decimal PLANT_INTERVAL("1.25");
decimal carrot_pos[NUM_CARROTS];

void plant_carrot(int idx) {
  carrot_pos[idx] = OFFSET_WIDTH + idx*PLANT_INTERVAL;
}
bool is_planted_ok(int idx) {
  return  (carrot_pos[idx] - carrot_pos[idx - 1]) == PLANT_INTERVAL;
}

"And it works! See, no epsilons involved!"

"Well, that's an option, certainly." - said the Fairy - "Keep in mind, though, that there is a computation cost to pay, so using this can be an overkill. Besides, I see that you've included a Boost header..."

"Is there anything wrong with Boost?"

"Nothing, it's a very good, production-quality library." - quickly answered the Fairy - "Let's move on. What else could you do?"

Let's fix it: Fixed point computations

Soon after Bob came back: "I know what else we have not tried yet! Fixed-point arithmetic! Let's just sweep this pesky floating point under the rag and pretend it never existed!"

const long OFFSET_WIDTH = 125480;    // distance to the start of the carrot patch (0.01 inches) 
const long PLANT_INTERVAL = 125;     // the planting interval (0.01 inches) 
long carrot_pos[NUM_CARROTS];        // carrot positions go here 

void plant_carrot( int idx) {
  carrot_pos[idx] = OFFSET_WIDTH + idx*PLANT_INTERVAL;
}

bool is_planted_ok( int idx) {
  return (carrot_pos[idx] - carrot_pos[idx - 1]) == PLANT_INTERVAL;
}

"See, it works! I am using only integers here!"

"Of course it does." - smiled the Fairy - "And it's a viable option, sometimes. Some embedded systems still may not have floating point implemented in hardware. But keep in mind, that your code may become much harder to comprehend, especially with complex computations. Any other ideas?"

Let's fix it: Rational numbers

"Gee," - Bob came back after a while - "I figured I could do the fraction computations myself, via rational numbers. I even wrote a little class for that":

class rational {
public:
  rational(int nom = 0, int den = 1) {
    int n = gcd10(nom, den);
    _n = nom/n;
    _d = den/n;
  }

  rational operator +(const rational& r) const { return rational(_n*r._d + r._n*_d, _d*r._d); }
  rational operator -(const rational& r) const { return rational(_n*r._d - r._n*_d, _d*r._d); }
  friend rational operator  *(int  n,  const rational& rhs) {  return  rational(n*rhs._n, rhs._d); }
  bool operator ==( const rational& r)  const  { return  (_n == r._n) && (_d == r._d); }

private:
  int _n, _d; // nominator/denominator
  static int gcd10(int a, int b) { return (a%10 | b%10) ? 1 : 10*gcd10(a/10, b/10); }
};

"Now, I can plant with rational numbers, with none of angry carrots:"

const rational OFFSET_WIDTH = rational(1254) + rational(8, 10);
const rational PLANT_INTERVAL = rational(1) + rational(25, 100);

rational carrot_pos[NUM_CARROTS];

void plant_carrot( int idx) {
  carrot_pos[idx] = OFFSET_WIDTH + idx*PLANT_INTERVAL;
}

bool is_planted_ok( int idx) {
  return (carrot_pos[idx] - carrot_pos[idx - 1]) == PLANT_INTERVAL;
}

"Sure, why not." - said the Fairy - "Don't forget, though, that irrational numbers also exist. One fine day you may need a square root, or something... oh, and by the way, did you know that some programming languages have the native support for the rational numbers? Check out this Clojure REPL session:"

Clojure 1.3.0
  user=> (def a 0.1)
  user=> (def b 0.2)
  user=> (def c 0.3)
  user=> [a b c]
  [0.1 0.2 0.3]
  user=> (= c (+ a b))
  false
  user=> (def ar (/ 1 10))
  user=> (def br (/ 2 10))
  user=> (def cr (/ 3 10))
  user=> [ar br cr] 
  [1/10 1/5 3/10]
  user=> (= cr (+ ar br))
  true

"So next time you might even want to use a different programming language. Right tool for the job, and all that. Now, as I mentioned, it's still not an ideal solution..."

For every answer, two more questions

At this point, the reader might have already started wondering if there is a happy end to this story.

Did Bob eventually find the solution?

Well, yes and no. He realized that floating-point is hard, and there is no an one-size-fits-all recipe to make it any easier. However, something that he did instead, was understanding and embracing this complexity.

After all, it could have been much worse without the IEEE 754 standard.

And that, in a sense, was a happy end.

Disclaimer

All characters appearing in this work are fictitious. Any resemblance to real persons, living or dead, is purely coincidental.

Except for the carrots. Those were real, goddamit.

Dec 1, 2014

Seven languages in seven hours

One day colleagues posted on the intranets a "programming challenge" (which was, in fact, borrowed from reddit):

Find the (last) longest sub-string of the given string that contains at most two unique characters.

For example:

"abcbcbcdeffgggfgfabdcbgfddffgggfgfcdab"

...in which case the result would be the last (green) "ffgggfgf" substring.

Since almost immediately people came up with all kinds of nice solutions in Java/C++/C# and other human programming languages, there was barely anything one could do to contribute more than that. That's what made me try doing something useless, and in the end this turned out to be a surprisingly fun and fulfilling experience which I'd like to try sharing.

In a sense, it was a "whirlwind tour into functional programming (syntax)". But first things first.

Regarding the algorithm itself - it is quite simple, and here's a variation of it that I came up with:

  • Look at every character ch in turn
  • Maintain the indices of the start and end of the current (latest) 2-unique-characters substring, cb and ce
  • Also have a "middle pointer", cs, which points to the last single-character substring inside the current cb/ce substring
  • Depending on whether ch is equal to str[cb] or str[cs] or something new, update the pointers cb and cs correspondingly.
  • On top of all that, maintain the start/end of the current longest 2-unique-characters substring, mb and me, and update those at every step correspondingly

Following is a little interactive thingie to illustrate the algorithm (beware inline JavaScript).

Let's say, we've got an example input string (you can try experimenting, entering different strings):

The current variable value table:

ch current character
ch0 current "older" character (cached, same as str[cb])
ch1 current "newer" character (cached, same as str[cs])
cb 0 current begin index
cs 0 current start of the latest ch1-only sunstring
ce 0 current end index
mb 0 current max substring begin index
me 0 current max substring end index

Use the controls to run/step/pause the algorithm and observe the current variable values:



Based on that, the reference implementation of the algorithm in C is quite straightforward:

char* longest2UniqueCharSubstring(char* str, char** end) {
    int mb = 0, me = 0, cb = 0, ce = 0, cb1 = 0;
    char ch0 = 0, ch1 = 0, ch;
    while (ch = str[ce]) {
        if (ch != ch1) {
            if (ch != ch0) {
                cb = cb1; cb1 = ce; ch0 = ch1; ch1 = ch;
            } else {
                cb1 = ce; ch0 = ch1; ch1 = ch;
            }            
        }
        ce++;
        if (me - mb <= ce - cb) {
            me = ce;
            mb = cb;
        }
    }
    *end = &str[me];
    return &str[mb];
}

(note that here we are already making some fishy assumptions, not directly stated in the problem - such as the input being an ASCII string)

Now, what's about all this "functional programming" business?.. Well, the main idea is that with functional programming all your program is like one big mathematical formula. If you look again at that "variable table", there are things being changed ("mutated") at every step, which is something that they don't really have in math formulas.

But then the question arises, is it possible to represent an algorithm, such as this, as a formula?.. Or any other program, for that matter?

Apparently it is. For example, we could express our algorithm like:

It may appear intimidating, but in fact it's quite simple and represents exactly the same algorithm as we looked into a few moments ago. Just this time without changing the "variable table" - every time we want to introduce a new value, we give it a new name (which is most often done with a subscript/overscript index). And that is essentially one of the cornerstones of functional programming. It's called immutability.

And then, in some sense, a "functional program" may be understood as this kind of formula translated directly into code.

So that was the idea - to come up with a fixed (functional programming) semantics and try to map it onto seven programming languages' syntax. Giving that I only have had a minimal exposure to three of them (Clojure, Scala, F#), that felt like an interesting research exercise. Thus is the name of the article - it took roughly an hour for each language to do the basic syntax research. Obviously, one can not learn proper idioms, tools, libraries in this time, but that was not the goal in this particular case.

Speaking about semantics, in addition to immutability there are few other "functional programming features" that we are going to use here, such as tuples, destructuring, higher-order functions (in particular, fold), closures/lambdas and to some extent pattern matching.

Then, the "mapping" from our math formula to a program, taking into account all the mentioned things, could be schematically represented like:

And that's almost all of it - we have already written our functional implementation of the algorithm... the only thing to do now is to add some language-specific syntax meat :)

So let's begin!

Designed by Simon Peyton Jones
When 1990
Platform Compiled or interpreted
Inspired by Clean, Miranda, ML, Scheme
Paradigm Functional
Typing Static. Like, really.
Nesting Significant whitespace
TIOBE rank 36
Fun fact Originally created as a research language
module Challenge06 where
-- |Finds the (last) longest substring that contains at most 2 unique chars
longest2UniqueCharSubstring :: String -> String
longest2UniqueCharSubstring s = take (mb - ma) . drop ma $ s
    where ((ma, mb, _, _), _, _, _) = foldl f init s
          -- initial accumulator value
          init = ((0, 0, 0, 0), 0, '\0', '\0')
          -- the fold lambda
          f = \ ((mb, me, cb, ce), cb1, ch0, ch1) ch ->
            let ce1 = ce + 1 
                getMax = \b e -> if (me - mb) > (e - b)
                    then (mb, me, b, e) else (b, e, b, e)
            in case () of _ | ch1 == ch -> (getMax cb  ce1, cb1, ch0, ch1)
                            | ch0 == ch -> (getMax cb  ce1, ce,  ch1, ch )
                            | otherwise -> (getMax cb1 ce1, ce,  ch1, ch )

I perceived Haskell as the most hardcore out of seven, mostly due to its purity and typing. Granted, it took me more time to do the input/output/benchmarking part (because of the peculiar way Haskell handles IO... it just did not make sense to me at first) than to write the actual algorithm.

The syntax itself is very terse - just look at the lambda syntax ("\ param1 param2 ... -> body"), or the way one can compose functions. E.g. the expression "take (mb - ma) . drop ma $ s" composes two (curried) functions with "." and then applies result to an argument via "$". No braces or anything.

The "let bindings in expression" and "expression where bindings" constructs both have to do with local variables (uhm... sorry, value bindings), just having things in opposite order (with the "where" one being more unusual).

An interesting thing is the "type signature" for the function, i.e. types of function input parameters and the output, which goes in a separate line right before the function itself. "longest2UniqueCharSubstring :: String -> String" means that "function longest2UniqueCharSubstring takes parameter of type String and returns parameter of type String". It's not obligatory (at least in this case), but is considered a good style to have it.

The "case () of _" construction is rather a hack to emulate Lisp's COND function (which seemed to be the most straightforward fit for that big curly "choice" brace in our math formula), using a powerful pattern matching facility in a primitive way.

Tuples syntax and destructuring in assignment and function parameters works pretty much as one would expect.

I imagine that after writing some working Haskell code one should start feeling like a secret ninja alien robot from space, and it's probably true. But in this particular piece of code there is not much magic.



Designed by Martin Odersky
When 2003
Platform JVM
Inspired by Eiffel, Erlang, Haskell, Java, Lisp, OCaml et al
Paradigm Both Functional and Object-Oriented
Typing Statically typed, with type inference
Nesting Curly braces, significant whitespace
TIOBE rank 41
Fun fact Scala's Actor framework (Akka) was backported to Java
object Challenge06 {
  /** Finds the (last) longest substring that contains 
    * at most two unique characters
    */
  def longest2UniqueCharSubstring(s:String) = {
    //  initial accumulator value
    val init = ((0, 0, 0, 0), 0, '\0', '\0')
    val ((mb, me, _, _), _, _, _) = s.foldLeft (init) {
      //  the fold lambda
      case (((mb, me, cb, ce), cb1, ch0, ch1), ch) => {
        val ce1 = ce + 1
        val getMax = {(b:Int, e:Int) =>
          if (me - mb > e - b) (mb, me, b, e) else (b, e, b, e)}
        ch match {
          case `ch1` => (getMax(cb,  ce1), cb1, ch0, ch1)
          case `ch0` => (getMax(cb,  ce1), ce,  ch1, ch )
          case _     => (getMax(cb1, ce1), ce,  ch1, ch )
        }
      }
    }
    s substring (mb, me)
  }
}

Scala is such a nice language!

But it's also a big language, with a lot of features. Which may or may not be a good thing.

Just look at the funny pattern matching syntax we have there. What we wanted there was to match value "ch" vs. the lambda parameter values "ch1" or "ch0". We use backticks there (`ch1`, `ch0`) in order to actually match with the existing value instead of doing a local binding (again, Lisp's COND style).

What we could also do is declaring intermediate values "val Ch1 = ch1" and "val Ch2 = ch2" and then doing "case Ch1 => ..." and "case Ch0 => ..." See, if its name starts with uppercase, then it also won't try do the local binding, but match the actual values instead... oh, and I won't be surprised if there are about five other ways to do the same thing, which I am not aware about.

Weird.

The destructuring in the function parameters is not supported directly, but one can use a "case" hack to emulate that (I did it for the "fold lambda").

Note that explicit argument types are required for function (and lambda) parameters, unlike in some other statically typed languages with type inference.

The top-level "object" in the code snippet is actually considered a "module" in Scala lingo.



Designed by Don Syme
When 2005
Platform .NET
Inspired by ML-family (OCaml)
Paradigm Functional/OOP hybrid
Typing Static, type inferred
Nesting Significant whitespace
TIOBE rank 15
Fun fact Most OCaml code is valid F#
?module Challenge06 = 
  /// Finds the (last) longest substring that contains 
  /// at most two unique characters
  let longest2UniqueCharSubstring (s:string) = 
    let ((mb, me, _, _), _, _, _) = 
      Seq.fold 
        //  the fold lambda
        (fun ((mb, me, cb, ce), cb1, ch0, ch1) ch -> 
          let ce1 = ce + 1
          let getMax b e = (if (me - mb > e - b) then 
                               (mb, me, b, e) else (b, e, b, e))
          match () with
            | _ when (ch = ch1) -> (getMax cb  ce1, cb1, ch0, ch1)
            | _ when (ch = ch0) -> (getMax cb  ce1, ce,  ch1, ch )
            | _                 -> (getMax cb1 ce1, ce,  ch1, ch ))
        //  initial accumulator value
        ((0, 0, 0, 0), 0, '\000', '\000')
        //  fold sequence argument
        s
    s.Substring(mb, me - mb)

For those people who might want to discard it with the often-heard "bah, it's Microsoft" - I assure you, it's a really cool language.

The syntax is terse thanks to the type inference and indentation rules (for example, unlike in OCaml, there is no need for "in" in the "let bindings in expression" blocks, however things must be properly indented).

I am also using a "COND" hack here, similarly to how I did that with Haskell.



Designed by Rich Hickey
When 2007
Platform JVM, .NET, JavaScript
Inspired by Lisp family, Erlang, Haskell
Paradigm Functional
Typing Dynamic
Nesting Parentheses!!!
TIOBE rank 50
Fun fact You can get contacted by headhunters just by doing 4clojure challenges
(ns challenge06 (:gen-class))
(defn longest2UniqueCharSubstring 
  "Finds the (last) longest substring that contains 
   at most two unique characters"
  [s]
  (let [ 
    [[mb me _ _] _ _]
    (reduce 
        ;; the fold (reduce) lambda
        (fn [[[mb me cb ce] b1 ch0 ch1], ch]
              (let [ ce+    (inc ce)
                     getMax (fn [b e] (if (> (- me mb) (- e b))
                                          [mb me b e] [b e b e]))]
                (cond
                  (= ch ch1) [(getMax cb ce+) b1 ch0 ch1]
                  (= ch ch0) [(getMax cb ce+) ce ch1 ch ]
                  :else      [(getMax b1 ce+) ce ch1 ch ])))
        ;; initial accumulator value           
        [[0 0 0 0] 0 \o0 \o0] 
        ;; fold sequence argument
        s)]
    (subs s mb me)))

Clojure is a modern Lisp.

With smart and well thought-out syntactic modifications which make it more readable than vanilla Lisp (for example, you get square brackets into the mix (so it's not just bazillions of parens anymore)).

Oh, and it has a proper COND :) I know, I could have just used "if/elseif/else" blocks in all the rest of the cases, anyway... but the goal was to explore if we can come up with an equivalent construct in all the languages.

By the way, the ":gen-class" in the namespace declaration was needed in order to force the Clojure compiler to dump the JVM IL (which otherwise is generated on the fly). Normally it's not needed.



Designed by Joe Armstrong
When 1986
Platform Erlang Virtual Machine (BEAM)
Inspired by Prolog, Lisp
Paradigm Functional
Typing Dynamic
Nesting comma-dot, semicolon-"end" blocks
TIOBE rank 47
Fun fact Initially was created for Telecom
-module(challenge06).
-export([longest2UniqueCharSubstring/1]).

%% @doc Finds the (last) longest substring that contains
%% at most two unique characters.
longest2UniqueCharSubstring(S) ->
  {{Mb, Me, _, _}, _, _, _} = lists:foldl(
    % the fold (reduce) lambda
    fun (Ch, {{Mb, Me, Cb, Ce}, Cb1, Ch0, Ch1}) ->
      Ce1 = Ce + 1,
      GetMax = fun
        (B, E) when (Me - Mb > E - B) -> {Mb, Me, B, E};
        (B, E) -> {B, E, B, E}
      end,
      if Ch =:= Ch1 -> {GetMax(Cb,  Ce1), Cb1, Ch0, Ch1};
         Ch =:= Ch0 -> {GetMax(Cb,  Ce1), Ce,  Ch1, Ch };
         true       -> {GetMax(Cb1, Ce1), Ce,  Ch1, Ch }
      end
    end,
    % initial accumulator value
    {{1, 1, 1, 1}, 1, $0, $0}, 
    % fold sequence argument
    S),
  string:sub_string(S, Mb, Me - 1).

This is the oldest and the most production-proven language of all seven (at least at the current moment). It is well-known for its capabilities when creating fault-tolerant, highly distributed systems. Something that we don't really care about right now.

It is also a "functional" programming language at its core. The syntax was inspired by Prolog, which explains those weird commas, semicolons and dots at strange places.

Some observations:

  • Variable names must always start from the uppercase
  • Module exports must be explicitly listed in the "-export" section (specifying the number of function arguments)
  • Array indexing starts from 1
  • Strings are represented as linked(!) lists of integers (which take 8 bytes on 64-bit platforms)

The last one is interesting.

In the first-pass algorithm implementation I was not keeping ch0 and ch1 characters explicitly, but only their indices inside the string (cb and cs correspondingly), looking up the character by indices when required. Imagine my surprise when the Erlang version would just appear to hang instead of producing the result.

What happened is that string being linked lists, the lookup by index is O(N) and the whole algorithm was O(N^2) instead of O(N). Which made it just too slow on my test data.

I actually had an opportunity to ask Joe Armstrong himself about why he made such a design decision. The answer was that "strings don't exist" (by the way, the whole talk on that link was pretty amusing). I kind of understand the sentiment, but still...

Designed by James Strachan
When 2003
Platform JVM
Inspired by Python, Ruby, Perl, Smalltalk
Paradigm OOP, with functional features
Typing Dynamic, partially static
Nesting Curly braces
TIOBE rank 44
Fun fact Most Java code is valid Groovy
class Challenge06 {
    /**
     * Finds the (last) longest substring that contains
     * at most two unique characters
    */
    static def longest2UniqueCharSubstring(String s) {
        // initial accumulator value
        def init = [[0, 0, 0, 0], 0, '\0', '\0']
        def (mpos) = s.inject (init) {acc, ch ->
            // the fold (inject) lambda body
            def (pos, cb1, ch0, ch1) = acc
            def (mb, me, cb, ce) = pos
            def ce1 = ce + 1
            def getMax = {b, e -> (me - mb > e - b) ?
                    [mb, me, b, e] : [b, e, b, e]}
            switch (ch) {
                case ch1: [getMax(cb,  ce1), cb1, ch0, ch1]; break
                case ch0: [getMax(cb,  ce1), ce,  ch1, ch ]; break
                default:  [getMax(cb1, ce1), ce,  ch1, ch ]
            }
        }
        s.substring(mpos[0], mpos[1])
    }
}

This language was created as a "nicer Java", and at times it does feel exactly like that. Fun fact: James Strachan himself once said, that if he knew that Scala will appear, he would not create Groovy, to start with.

Nevertheless, the language is still neat and practical, and I believe is worth looking into. For one thing, there is already a bunch of code written in it for different projects.

There are some quirks, for example nested destructuring in assignment is not supported (and none supported at all in the function parameters), so I had to "flatten" it with multiple reassignments.

The switch/case statement is Java-like, which is kind of lame.



Designed by Jeremy Ashkenas
When 2009
Platform JavaScript
Inspired by Ruby, Python, Haskell
Paradigm Functional, prototype-based
Typing Dynamic
Nesting Significant whitespace
TIOBE rank 170
Fun fact First CoffeeScript compiler was written in Ruby
###*
 * Finds the (last) longest substring that contains
 * at most two unique characters
 ###
longest2UniqueCharSubstring = (s) ->
    [[mb, me]] = s.split('').reduce(
        # the fold (reduce) lambda
        ([[mb, me, cb, ce], cb1, ch0, ch1], ch) ->
            ce1 = ce + 1
            getMax = (b, e) ->
                if (me - mb > e - b) then [mb, me, b, e] else [b, e, b, e]
            switch ch
                when ch1 then [getMax(cb,  ce1), cb1, ch0, ch1]
                when ch0 then [getMax(cb,  ce1), ce,  ch1, ch ]
                else          [getMax(cb1, ce1), ce,  ch1, ch ]
        # initial accumulator value        
        ,[[0, 0, 0, 0], 0, '', ''])
    s.substring(mb, me);

Many people would argue that CoffeeScript is not even a language, but rather a syntax sugar on top of JavaScript. Which is fair. But it's pretty sweet sugar, mind you... just look at that lambda syntax!

Therefore, one could say that it's as a functional programming language as JavaScript is, just with the syntax that makes this "functional programming" nature more prominent.

Some of the features are still lacking, though. For example, there is no tuples per se (it uses arrays instead). Which is also explained by it being just a JavaScript in disguise.

Oh, by the way there is no modules. Well, you could hack them in, but that defeats the purpose, in my opinion.

Fun fact: that interactive JavaScript algorithm demonstration above is actually made from this piece of CoffeeScript code.


UPDATE Jan 13, 2017: Here's a nice, concise introduction to CoffeeScript if you are interested to know more.


So, we're done with our seven languages. A next logical thing to do was to try comparing performance of the different language implementations. Here are the system specifics:

  • The text is Lewis Carrol's "Alice's Adventures in Wonderland" downloaded from Project Gutenberg
  • All the whitespace was removed, and everything was converted to lowercase. After that, it was replicated several times. The resulting file is about ~4Mb
  • The platform was Win7 64-bit, Intel i7 (4 cores), 6Gb RAM
  • Recent (ish) software versions:
    • Scala v2.11, jvm 1.8
    • Haskell GHC v7.6.3 (-O2 flag)
    • F# v2.0, .NET 4, VS 2010
    • Groovy v2.2.2, jvm 1.8
    • Clojure v1.5.1, jvm 1.8
    • Erlang OTP v17
    • CoffeeScript v1.7.1, Node.js v0.10.28

And here's what I've got when running the mentioned code on the mentioned data:


Now, this particular problem is not a very good language benchmark.

Still, it's kind of interesting. Some observations:

  • Somehow Scala beats everyone (24ms)
  • ...but nothing can beat C (15ms). Still, can get pretty close!
  • The generated JVM IL size: Scala is the smallest, Groovy is big (which may explain its sub-par speed), Clojure is even bigger (probably due to the way it gets compiled on the fly)
  • F# translates neatly into C#-esque IL, no fluff
  • JIT performance matters. JVM seems to be better than Erlang's BEAM
  • Erlang strings are inefficient. Cache misses matter.
  • JVM 1.8 vs 1.6: Groovy for some reason is worse in 1.8, Scala better, Clojure is the same
  • "CoffeeScript performance" actually means "JavaScript performance", since all the code translates into JavaScript in a straightforward way before execution

Few more charts, out of the pure amusement.

The languages' age (Erlang being the oldest, CoffeeScript the youngest):

TIOBE popularity rank (the smaller, the more popular):

Github open source rank (the smaller, the more projects are there for this language):

You can check yourself some more interesting graphs on Open HUB. Also, try throwing C++ into the mix, just for kicks ;)

The overall feeling of experimenting with new programming languages turned out to be very exciting and appeared worth sharing. I even presented this to the colleagues at some point, which seemed to have got a very positive feedback.

So I encourage you to try doing something like this, if you have not yet:

  • Come up with a simple (but interesting) problem
  • Pick a programming language you have not used before
  • Do the research
  • Write the program (and possibly share)
  • Reflect
  • Enjoy
Good luck and gospeed!