Dec 1, 2014

Seven languages in seven hours

One day colleagues posted on the intranets a "programming challenge" (which was, in fact, borrowed from reddit):

Find the (last) longest sub-string of the given string that contains at most two unique characters.

For example:

"abcbcbcdeffgggfgfabdcbgfddffgggfgfcdab"

...in which case the result would be the last (green) "ffgggfgf" substring.

Since almost immediately people came up with all kinds of nice solutions in Java/C++/C# and other human programming languages, there was barely anything one could do to contribute more than that. That's what made me try doing something useless, and in the end this turned out to be a surprisingly fun and fulfilling experience which I'd like to try sharing.

In a sense, it was a "whirlwind tour into functional programming (syntax)". But first things first.

Regarding the algorithm itself - it is quite simple, and here's a variation of it that I came up with:

  • Look at every character ch in turn
  • Maintain the indices of the start and end of the current (latest) 2-unique-characters substring, cb and ce
  • Also have a "middle pointer", cs, which points to the last single-character substring inside the current cb/ce substring
  • Depending on whether ch is equal to str[cb] or str[cs] or something new, update the pointers cb and cs correspondingly.
  • On top of all that, maintain the start/end of the current longest 2-unique-characters substring, mb and me, and update those at every step correspondingly

Following is a little interactive thingie to illustrate the algorithm (beware inline JavaScript).

Let's say, we've got an example input string (you can try experimenting, entering different strings):

The current variable value table:

ch current character
ch0 current "older" character (cached, same as str[cb])
ch1 current "newer" character (cached, same as str[cs])
cb 0 current begin index
cs 0 current start of the latest ch1-only sunstring
ce 0 current end index
mb 0 current max substring begin index
me 0 current max substring end index

Use the controls to run/step/pause the algorithm and observe the current variable values:



Based on that, the reference implementation of the algorithm in C is quite straightforward:

char* longest2UniqueCharSubstring(char* str, char** end) {
    int mb = 0, me = 0, cb = 0, ce = 0, cb1 = 0;
    char ch0 = 0, ch1 = 0, ch;
    while (ch = str[ce]) {
        if (ch != ch1) {
            if (ch != ch0) {
                cb = cb1; cb1 = ce; ch0 = ch1; ch1 = ch;
            } else {
                cb1 = ce; ch0 = ch1; ch1 = ch;
            }            
        }
        ce++;
        if (me - mb <= ce - cb) {
            me = ce;
            mb = cb;
        }
    }
    *end = &str[me];
    return &str[mb];
}

(note that here we are already making some fishy assumptions, not directly stated in the problem - such as the input being an ASCII string)

Now, what's about all this "functional programming" business?.. Well, the main idea is that with functional programming all your program is like one big mathematical formula. If you look again at that "variable table", there are things being changed ("mutated") at every step, which is something that they don't really have in math formulas.

But then the question arises, is it possible to represent an algorithm, such as this, as a formula?.. Or any other program, for that matter?

Apparently it is. For example, we could express our algorithm like:

It may appear intimidating, but in fact it's quite simple and represents exactly the same algorithm as we looked into a few moments ago. Just this time without changing the "variable table" - every time we want to introduce a new value, we give it a new name (which is most often done with a subscript/overscript index). And that is essentially one of the cornerstones of functional programming. It's called immutability.

And then, in some sense, a "functional program" may be understood as this kind of formula translated directly into code.

So that was the idea - to come up with a fixed (functional programming) semantics and try to map it onto seven programming languages' syntax. Giving that I only have had a minimal exposure to three of them (Clojure, Scala, F#), that felt like an interesting research exercise. Thus is the name of the article - it took roughly an hour for each language to do the basic syntax research. Obviously, one can not learn proper idioms, tools, libraries in this time, but that was not the goal in this particular case.

Speaking about semantics, in addition to immutability there are few other "functional programming features" that we are going to use here, such as tuples, destructuring, higher-order functions (in particular, fold), closures/lambdas and to some extent pattern matching.

Then, the "mapping" from our math formula to a program, taking into account all the mentioned things, could be schematically represented like:

And that's almost all of it - we have already written our functional implementation of the algorithm... the only thing to do now is to add some language-specific syntax meat :)

So let's begin!

Designed by Simon Peyton Jones
When 1990
Platform Compiled or interpreted
Inspired by Clean, Miranda, ML, Scheme
Paradigm Functional
Typing Static. Like, really.
Nesting Significant whitespace
TIOBE rank 36
Fun fact Originally created as a research language
module Challenge06 where
-- |Finds the (last) longest substring that contains at most 2 unique chars
longest2UniqueCharSubstring :: String -> String
longest2UniqueCharSubstring s = take (mb - ma) . drop ma $ s
    where ((ma, mb, _, _), _, _, _) = foldl f init s
          -- initial accumulator value
          init = ((0, 0, 0, 0), 0, '\0', '\0')
          -- the fold lambda
          f = \ ((mb, me, cb, ce), cb1, ch0, ch1) ch ->
            let ce1 = ce + 1 
                getMax = \b e -> if (me - mb) > (e - b)
                    then (mb, me, b, e) else (b, e, b, e)
            in case () of _ | ch1 == ch -> (getMax cb  ce1, cb1, ch0, ch1)
                            | ch0 == ch -> (getMax cb  ce1, ce,  ch1, ch )
                            | otherwise -> (getMax cb1 ce1, ce,  ch1, ch )

I perceived Haskell as the most hardcore out of seven, mostly due to its purity and typing. Granted, it took me more time to do the input/output/benchmarking part (because of the peculiar way Haskell handles IO... it just did not make sense to me at first) than to write the actual algorithm.

The syntax itself is very terse - just look at the lambda syntax ("\ param1 param2 ... -> body"), or the way one can compose functions. E.g. the expression "take (mb - ma) . drop ma $ s" composes two (curried) functions with "." and then applies result to an argument via "$". No braces or anything.

The "let bindings in expression" and "expression where bindings" constructs both have to do with local variables (uhm... sorry, value bindings), just having things in opposite order (with the "where" one being more unusual).

An interesting thing is the "type signature" for the function, i.e. types of function input parameters and the output, which goes in a separate line right before the function itself. "longest2UniqueCharSubstring :: String -> String" means that "function longest2UniqueCharSubstring takes parameter of type String and returns parameter of type String". It's not obligatory (at least in this case), but is considered a good style to have it.

The "case () of _" construction is rather a hack to emulate Lisp's COND function (which seemed to be the most straightforward fit for that big curly "choice" brace in our math formula), using a powerful pattern matching facility in a primitive way.

Tuples syntax and destructuring in assignment and function parameters works pretty much as one would expect.

I imagine that after writing some working Haskell code one should start feeling like a secret ninja alien robot from space, and it's probably true. But in this particular piece of code there is not much magic.



Designed by Martin Odersky
When 2003
Platform JVM
Inspired by Eiffel, Erlang, Haskell, Java, Lisp, OCaml et al
Paradigm Both Functional and Object-Oriented
Typing Statically typed, with type inference
Nesting Curly braces, significant whitespace
TIOBE rank 41
Fun fact Scala's Actor framework (Akka) was backported to Java
object Challenge06 {
  /** Finds the (last) longest substring that contains 
    * at most two unique characters
    */
  def longest2UniqueCharSubstring(s:String) = {
    //  initial accumulator value
    val init = ((0, 0, 0, 0), 0, '\0', '\0')
    val ((mb, me, _, _), _, _, _) = s.foldLeft (init) {
      //  the fold lambda
      case (((mb, me, cb, ce), cb1, ch0, ch1), ch) => {
        val ce1 = ce + 1
        val getMax = {(b:Int, e:Int) =>
          if (me - mb > e - b) (mb, me, b, e) else (b, e, b, e)}
        ch match {
          case `ch1` => (getMax(cb,  ce1), cb1, ch0, ch1)
          case `ch0` => (getMax(cb,  ce1), ce,  ch1, ch )
          case _     => (getMax(cb1, ce1), ce,  ch1, ch )
        }
      }
    }
    s substring (mb, me)
  }
}

Scala is such a nice language!

But it's also a big language, with a lot of features. Which may or may not be a good thing.

Just look at the funny pattern matching syntax we have there. What we wanted there was to match value "ch" vs. the lambda parameter values "ch1" or "ch0". We use backticks there (`ch1`, `ch0`) in order to actually match with the existing value instead of doing a local binding (again, Lisp's COND style).

What we could also do is declaring intermediate values "val Ch1 = ch1" and "val Ch2 = ch2" and then doing "case Ch1 => ..." and "case Ch0 => ..." See, if its name starts with uppercase, then it also won't try do the local binding, but match the actual values instead... oh, and I won't be surprised if there are about five other ways to do the same thing, which I am not aware about.

Weird.

The destructuring in the function parameters is not supported directly, but one can use a "case" hack to emulate that (I did it for the "fold lambda").

Note that explicit argument types are required for function (and lambda) parameters, unlike in some other statically typed languages with type inference.

The top-level "object" in the code snippet is actually considered a "module" in Scala lingo.



Designed by Don Syme
When 2005
Platform .NET
Inspired by ML-family (OCaml)
Paradigm Functional/OOP hybrid
Typing Static, type inferred
Nesting Significant whitespace
TIOBE rank 15
Fun fact Most OCaml code is valid F#
?module Challenge06 = 
  /// Finds the (last) longest substring that contains 
  /// at most two unique characters
  let longest2UniqueCharSubstring (s:string) = 
    let ((mb, me, _, _), _, _, _) = 
      Seq.fold 
        //  the fold lambda
        (fun ((mb, me, cb, ce), cb1, ch0, ch1) ch -> 
          let ce1 = ce + 1
          let getMax b e = (if (me - mb > e - b) then 
                               (mb, me, b, e) else (b, e, b, e))
          match () with
            | _ when (ch = ch1) -> (getMax cb  ce1, cb1, ch0, ch1)
            | _ when (ch = ch0) -> (getMax cb  ce1, ce,  ch1, ch )
            | _                 -> (getMax cb1 ce1, ce,  ch1, ch ))
        //  initial accumulator value
        ((0, 0, 0, 0), 0, '\000', '\000')
        //  fold sequence argument
        s
    s.Substring(mb, me - mb)

For those people who might want to discard it with the often-heard "bah, it's Microsoft" - I assure you, it's a really cool language.

The syntax is terse thanks to the type inference and indentation rules (for example, unlike in OCaml, there is no need for "in" in the "let bindings in expression" blocks, however things must be properly indented).

I am also using a "COND" hack here, similarly to how I did that with Haskell.



Designed by Rich Hickey
When 2007
Platform JVM, .NET, JavaScript
Inspired by Lisp family, Erlang, Haskell
Paradigm Functional
Typing Dynamic
Nesting Parentheses!!!
TIOBE rank 50
Fun fact You can get contacted by headhunters just by doing 4clojure challenges
(ns challenge06 (:gen-class))
(defn longest2UniqueCharSubstring 
  "Finds the (last) longest substring that contains 
   at most two unique characters"
  [s]
  (let [ 
    [[mb me _ _] _ _]
    (reduce 
        ;; the fold (reduce) lambda
        (fn [[[mb me cb ce] b1 ch0 ch1], ch]
              (let [ ce+    (inc ce)
                     getMax (fn [b e] (if (> (- me mb) (- e b))
                                          [mb me b e] [b e b e]))]
                (cond
                  (= ch ch1) [(getMax cb ce+) b1 ch0 ch1]
                  (= ch ch0) [(getMax cb ce+) ce ch1 ch ]
                  :else      [(getMax b1 ce+) ce ch1 ch ])))
        ;; initial accumulator value           
        [[0 0 0 0] 0 \o0 \o0] 
        ;; fold sequence argument
        s)]
    (subs s mb me)))

Clojure is a modern Lisp.

With smart and well thought-out syntactic modifications which make it more readable than vanilla Lisp (for example, you get square brackets into the mix (so it's not just bazillions of parens anymore)).

Oh, and it has a proper COND :) I know, I could have just used "if/elseif/else" blocks in all the rest of the cases, anyway... but the goal was to explore if we can come up with an equivalent construct in all the languages.

By the way, the ":gen-class" in the namespace declaration was needed in order to force the Clojure compiler to dump the JVM IL (which otherwise is generated on the fly). Normally it's not needed.



Designed by Joe Armstrong
When 1986
Platform Erlang Virtual Machine (BEAM)
Inspired by Prolog, Lisp
Paradigm Functional
Typing Dynamic
Nesting comma-dot, semicolon-"end" blocks
TIOBE rank 47
Fun fact Initially was created for Telecom
-module(challenge06).
-export([longest2UniqueCharSubstring/1]).

%% @doc Finds the (last) longest substring that contains
%% at most two unique characters.
longest2UniqueCharSubstring(S) ->
  {{Mb, Me, _, _}, _, _, _} = lists:foldl(
    % the fold (reduce) lambda
    fun (Ch, {{Mb, Me, Cb, Ce}, Cb1, Ch0, Ch1}) ->
      Ce1 = Ce + 1,
      GetMax = fun
        (B, E) when (Me - Mb > E - B) -> {Mb, Me, B, E};
        (B, E) -> {B, E, B, E}
      end,
      if Ch =:= Ch1 -> {GetMax(Cb,  Ce1), Cb1, Ch0, Ch1};
         Ch =:= Ch0 -> {GetMax(Cb,  Ce1), Ce,  Ch1, Ch };
         true       -> {GetMax(Cb1, Ce1), Ce,  Ch1, Ch }
      end
    end,
    % initial accumulator value
    {{1, 1, 1, 1}, 1, $0, $0}, 
    % fold sequence argument
    S),
  string:sub_string(S, Mb, Me - 1).

This is the oldest and the most production-proven language of all seven (at least at the current moment). It is well-known for its capabilities when creating fault-tolerant, highly distributed systems. Something that we don't really care about right now.

It is also a "functional" programming language at its core. The syntax was inspired by Prolog, which explains those weird commas, semicolons and dots at strange places.

Some observations:

  • Variable names must always start from the uppercase
  • Module exports must be explicitly listed in the "-export" section (specifying the number of function arguments)
  • Array indexing starts from 1
  • Strings are represented as linked(!) lists of integers (which take 8 bytes on 64-bit platforms)

The last one is interesting.

In the first-pass algorithm implementation I was not keeping ch0 and ch1 characters explicitly, but only their indices inside the string (cb and cs correspondingly), looking up the character by indices when required. Imagine my surprise when the Erlang version would just appear to hang instead of producing the result.

What happened is that string being linked lists, the lookup by index is O(N) and the whole algorithm was O(N^2) instead of O(N). Which made it just too slow on my test data.

I actually had an opportunity to ask Joe Armstrong himself about why he made such a design decision. The answer was that "strings don't exist" (by the way, the whole talk on that link was pretty amusing). I kind of understand the sentiment, but still...

Designed by James Strachan
When 2003
Platform JVM
Inspired by Python, Ruby, Perl, Smalltalk
Paradigm OOP, with functional features
Typing Dynamic, partially static
Nesting Curly braces
TIOBE rank 44
Fun fact Most Java code is valid Groovy
class Challenge06 {
    /**
     * Finds the (last) longest substring that contains
     * at most two unique characters
    */
    static def longest2UniqueCharSubstring(String s) {
        // initial accumulator value
        def init = [[0, 0, 0, 0], 0, '\0', '\0']
        def (mpos) = s.inject (init) {acc, ch ->
            // the fold (inject) lambda body
            def (pos, cb1, ch0, ch1) = acc
            def (mb, me, cb, ce) = pos
            def ce1 = ce + 1
            def getMax = {b, e -> (me - mb > e - b) ?
                    [mb, me, b, e] : [b, e, b, e]}
            switch (ch) {
                case ch1: [getMax(cb,  ce1), cb1, ch0, ch1]; break
                case ch0: [getMax(cb,  ce1), ce,  ch1, ch ]; break
                default:  [getMax(cb1, ce1), ce,  ch1, ch ]
            }
        }
        s.substring(mpos[0], mpos[1])
    }
}

This language was created as a "nicer Java", and at times it does feel exactly like that. Fun fact: James Strachan himself once said, that if he knew that Scala will appear, he would not create Groovy, to start with.

Nevertheless, the language is still neat and practical, and I believe is worth looking into. For one thing, there is already a bunch of code written in it for different projects.

There are some quirks, for example nested destructuring in assignment is not supported (and none supported at all in the function parameters), so I had to "flatten" it with multiple reassignments.

The switch/case statement is Java-like, which is kind of lame.



Designed by Jeremy Ashkenas
When 2009
Platform JavaScript
Inspired by Ruby, Python, Haskell
Paradigm Functional, prototype-based
Typing Dynamic
Nesting Significant whitespace
TIOBE rank 170
Fun fact First CoffeeScript compiler was written in Ruby
###*
 * Finds the (last) longest substring that contains
 * at most two unique characters
 ###
longest2UniqueCharSubstring = (s) ->
    [[mb, me]] = s.split('').reduce(
        # the fold (reduce) lambda
        ([[mb, me, cb, ce], cb1, ch0, ch1], ch) ->
            ce1 = ce + 1
            getMax = (b, e) ->
                if (me - mb > e - b) then [mb, me, b, e] else [b, e, b, e]
            switch ch
                when ch1 then [getMax(cb,  ce1), cb1, ch0, ch1]
                when ch0 then [getMax(cb,  ce1), ce,  ch1, ch ]
                else          [getMax(cb1, ce1), ce,  ch1, ch ]
        # initial accumulator value        
        ,[[0, 0, 0, 0], 0, '', ''])
    s.substring(mb, me);

Many people would argue that CoffeeScript is not even a language, but rather a syntax sugar on top of JavaScript. Which is fair. But it's pretty sweet sugar, mind you... just look at that lambda syntax!

Therefore, one could say that it's as a functional programming language as JavaScript is, just with the syntax that makes this "functional programming" nature more prominent.

Some of the features are still lacking, though. For example, there is no tuples per se (it uses arrays instead). Which is also explained by it being just a JavaScript in disguise.

Oh, by the way there is no modules. Well, you could hack them in, but that defeats the purpose, in my opinion.

Fun fact: that interactive JavaScript algorithm demonstration above is actually made from this piece of CoffeeScript code.


UPDATE Jan 13, 2017: Here's a nice, concise introduction to CoffeeScript if you are interested to know more.


So, we're done with our seven languages. A next logical thing to do was to try comparing performance of the different language implementations. Here are the system specifics:

  • The text is Lewis Carrol's "Alice's Adventures in Wonderland" downloaded from Project Gutenberg
  • All the whitespace was removed, and everything was converted to lowercase. After that, it was replicated several times. The resulting file is about ~4Mb
  • The platform was Win7 64-bit, Intel i7 (4 cores), 6Gb RAM
  • Recent (ish) software versions:
    • Scala v2.11, jvm 1.8
    • Haskell GHC v7.6.3 (-O2 flag)
    • F# v2.0, .NET 4, VS 2010
    • Groovy v2.2.2, jvm 1.8
    • Clojure v1.5.1, jvm 1.8
    • Erlang OTP v17
    • CoffeeScript v1.7.1, Node.js v0.10.28

And here's what I've got when running the mentioned code on the mentioned data:


Now, this particular problem is not a very good language benchmark.

Still, it's kind of interesting. Some observations:

  • Somehow Scala beats everyone (24ms)
  • ...but nothing can beat C (15ms). Still, can get pretty close!
  • The generated JVM IL size: Scala is the smallest, Groovy is big (which may explain its sub-par speed), Clojure is even bigger (probably due to the way it gets compiled on the fly)
  • F# translates neatly into C#-esque IL, no fluff
  • JIT performance matters. JVM seems to be better than Erlang's BEAM
  • Erlang strings are inefficient. Cache misses matter.
  • JVM 1.8 vs 1.6: Groovy for some reason is worse in 1.8, Scala better, Clojure is the same
  • "CoffeeScript performance" actually means "JavaScript performance", since all the code translates into JavaScript in a straightforward way before execution

Few more charts, out of the pure amusement.

The languages' age (Erlang being the oldest, CoffeeScript the youngest):

TIOBE popularity rank (the smaller, the more popular):

Github open source rank (the smaller, the more projects are there for this language):

You can check yourself some more interesting graphs on Open HUB. Also, try throwing C++ into the mix, just for kicks ;)

The overall feeling of experimenting with new programming languages turned out to be very exciting and appeared worth sharing. I even presented this to the colleagues at some point, which seemed to have got a very positive feedback.

So I encourage you to try doing something like this, if you have not yet:

  • Come up with a simple (but interesting) problem
  • Pick a programming language you have not used before
  • Do the research
  • Write the program (and possibly share)
  • Reflect
  • Enjoy
Good luck and gospeed!

3 comments :

WTN said...

Do you know how/why the scala runs so quickly? And how the generated JVM bytecode compares to the other language's bytecode?

ctrl said...

In Haskell Strings are linked lists same as in Erlang. The Text and Bytestring libraries are often recommended instead of Strings.

Patrick Prémont said...

I'd be curious to see how much a foldr, or foldl' in Haskell would improve its performance. The foldl is generally disparaged because with laziness it just builds a big stack of thunks, before any call to f is reduced.