Archive for the ‘Design’ Category

What is the importance of the names that are given to things when programming? There seem to be quite some different approaches to this issue. Some people think that calling data items/variables a, b1, c is OK. Less people are seen defending the idea that functions should be called f1, f2, f2. Some other people think variables should have names connected with their meaning, so, if you have a data item whose content is related to a person object, maybe that variable should be called something with “person” on its name, like currentPerson, or whatever.

So far, nothing new. I for one, am in the group of the explicit naming of everything.

Sometimes the name has some impedance between the average expectation and the functionality exposed. That can be a source of lots of grief. Lets have a look at a concrete example to make things clear.

Clojure has a function named contains? . How do you expect such a function to operate? Let me give a few examples (and if you know how it really operates in reality, try to forget for now what you know).

What you think would be the result of:

1
2
3
4
5
(contains? [3 5] 3)
(contains? [3 5] 1)
(contains? "bla" 1)
(contains? "bla" "a")
(contains? '(1 2) 1)

Does [3 5] contains 3?
Does [3 5] contains 1?
Does “bla” contains 1?
Does “bla” contains “a”?
Does (1 2) contains 1?

Actually (contains? [3 5] 3) is false. Why? Lets have a look at the documentation:

clojure.core/contains?
([coll key])
Returns true if key is present in the given collection, otherwise
returns false.  Note that for numerically indexed collections like
vectors and Java arrays, this tests if the numeric key is within the
range of indexes. [...]

So contains? is looking at the KEYS of a collection. The vector [3 5] has two keys: 0 and 1 (the indexes with occupation of the said vector), that is why (contains? “bla” 1) is true (the string “bla” has 3 keys 0, 1 and 2.

By the way, the list (1 2) is not a collection, so that false is caused by a type error.

Note that contains? actually has some utility ;) : it is used to check if a map has a certain key.

There is an slight impedance between what most people would expect from “containing something” and contains? It so happens that slight impedances are much worse than big ones, because big ones are so damn obvious that they are easy to spot and normally end up being corrected (imagine a function drawLine, that draws a circle: it is a dead obvious problem that one will easily notice and the developer most probably correct). Slight impedance problems are more obnoxious:

  • They will confuse newbies (like myself), whose expectations will be aligned with the general knowledge of a certain name or action.
  • They will increase the cognitive load of experts which will need to be aware of the dissonance between the meaning of a certain word in the programming language and the meaning in the natural language.
  • It will make reading and understanding a program more difficult.
  • They will be a source of bugs, as developers, even experienced one, will sometimes forget the local meaning of a word and unconsciously apply the general meaning.

By the way, my personal solution for contains? Well, quite simply rename it hasKey? and only apply it to maps. Note that this post is not about contains? in particular, but about this “impedance” problem in general.

I would imagine that some readers might think this as irrelevant and nitpicking. In my subjective and personal point of view, this issue is a potential source of hard to find bugs and a waste of mental energy. In fact this blog is called Cognitive _Consonance_ for some reason…

Social network sharing
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • LinkedIn
  • connotea
  • FriendFeed
  • Twitter
  • Yahoo! Bookmarks

There is, in my view, an inconsistency with the clojure core API with regards to type checking.

Consider the two functions contains? and even?

contains? returns true if a certain collection has a certain key:

=> (contains? {'a 1} 'a)
true
=> (contains? {'a 1} 'b)
false

If you pass an object which is not a collection, contains? silently returns false.

=> (contains? 1 'a)
false

I.e., a type error is not distinguishable from a collection which does not contain a certain element.

even? , the function to check if a certain number is, well, even, behaves in a completely different fashion:

=> (even? 'a)
java.lang.ClassCastException: clojure.lang.Symbol cannot be cast to java.lang.Number (NO_SOURCE_FILE:0)

A type error on the parameter raises an exception with even?.

From a design philosophy I really do not like this inconsistent behavior: For the same type of error (a typical error pattern) the API behaves in a clearly different way.

The 2 points that are important (but, alas, are not the fundamental issue of this post) are:

  • Core functions should have a coherent and consistent way of dealing with type errors. This is the most important point.
  • If they have a consistent way of dealing with type errors, my preference would be for a behavior like even? (ie, throw) and not like contains?.

Yes, I do understand that other reasons might have taken precedence (like performance). I still don’t like it.

But the beauty of clojure is that one can redefine these “core” functions. For instance, I prefer to have

(defn contains? [coll key]
  (if (coll? coll)
    (clojure.core/contains? coll key)
    (throw (new ClassCastException
                     (str (type coll) " cannot be cast as collection")))
  )
)

[Newbie alert: there might be better ways to design a function like this (suggestions welcome).]

Now, one as to be careful not to import the original contains? into a namespace. Easy done in clojure:

(ns myuser (:refer-clojure :exclude [contains?]))

This has to be done before the definition of the new contains? (note that, when calling the old contains? it includes the full namespace).

Of course, redefining core functions (even if inside namespaces not seen outside) is a bit like laying a mine field and probably has to be done with care. Irrespective of that, it is good to be able to have a language which allows one to express itself with the syntax and semantics that one desires, and not to be constrained to the whims of the original developer.

So far, no big problems found with Clojure (at least until now).

Social network sharing
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • LinkedIn
  • connotea
  • FriendFeed
  • Twitter
  • Yahoo! Bookmarks

More than 10 years ago I participated in the development of an University IT system (the front- and backend to maintain grades and that sort of stuff). The system was based on a DB/2 backend (a very nice database system) with the business code stored on a Prolog interpreter (Prolog interpreter which was in-house developed) and the web backend being a Java servlet engine (the old JServ, the thingy pre-Tomcat from Apache). Prolog is famed to be slow, and Java (at that point in time) was very slow. Surprise, surprise… the bottleneck was on the DB/2 server. Eventually, as the system grow (and the database hardware was beefed up) the bottleneck come forward to the business and web tiers, but the problem was sorted by just adding more machines: The contention was on a bunch of parallel independent process, they could be run on separate machines.

The example above illustrates why the concurrency problem posed by multiple core CPUs and GPUs, might not be that much important:

  1. Many problems are not CPU bound anyway, and even if they are, the bottleneck might be elsewhere. Another example: I am the proud owner of 3 cheap, slow laptops (one being a netbook). For my use case I really don’t need faster applications, I wonder how many users really need more than they already have?
  2. Even if more CPU/GPU power is needed, a loosely coupled model (without much interprocess communication and contention issues) might be enough. This is typically the case of many web apps, which can scale by just adding more computers which run independent processes.

Concurrency, even with modern abstractions, is hard. It should be avoided if possible and it can be avoided in many applications. If it cannot be avoided, maybe a loosely coupled model is enough… Guido van Rossum has a nice take on this issue.

This is important as concurrency is being touted as an important criteria to evaluate languages. Modern functional languages (think Scala and Clojure) are being touted as a better option precisely because they are better to do concurrency (both because of functional – “no changing state” – programming and the availability of libraries implementing nice concurrency paradigms like actors).

When addressing this importance of this issue, I would propose, that people would ask themselves this: “Am I developing computationally intensive software?” and “If I am developing computationally intensive software, can I live with loosely coupled models of computation, preferably processes with no shared memory?”

This is not to say that there are not some cases where tightly coupled computing is a good idea. It is just that, this complex solution might be an overkill for many problems.

I would just like to add that I am not defending my cause, in fact it is quite the opposite. There is actually some content produced here, in the past, on how to tackle concurrent programming:

  1. LOSITAN – A multicore-aware Jython-based (Python for the JVM) Web Start application to do selection detection.
  2. An introductory tutorial on concurrent computing targeting computational biologists – Part 1, 2 and 3
Social network sharing
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • LinkedIn
  • connotea
  • FriendFeed
  • Twitter
  • Yahoo! Bookmarks

It is interesting to see how different people tackle the ongoing multicore (and GPU) software “revolution”. There are strong philosophical differences on how to develop for these new concurrent architectures. Lets start with the extremes.

The most interesting extreme comes from Guido van Rossum (aka Python benevolent dictator for life): He suggests that if you want to use the available processing power of multiple cores you should have separated processes, let me quote:

[...] doesn’t mean that multiple processes (with judicious use of IPC) aren’t a much better approach to writing apps for multi-CPU boxes than threads.

Just Say No to the combined evils of locking, deadlocks, lock granularity, livelocks, nondeterminism and race conditions.

Some similar arguments are made by the message passing crowd, which seems to be quite happy with a model based on explicit message passing between separated processes.

The fundamental idea here is that shared memory between parallel computing threads can lead to a lot of grief and sorrow, thus is is better if all the data memory space is the sole propriety of a single thread. Communication occurs in a explicit form (e.g., message passing among executing code) between threads that do not share anything (other than messages).

The opposite idea can be found on the typical C/C++/Fortran, lower-level crowd: One single process, many threads, a single memory space shared among threads with concurrent access controlled through a low level mechanism like semaphores. This seems also to be the underlying idea of the OpenMP system. These folks believe that programmers can tackle parallel complexity easily (well, at least it is not an impossible, daunting task according to this philosophy).

The point of contention comes from the fact that multiple execution flows introduce a completely new class of bugs coming from the need to coordinate a lot of things going on in parallel. The worst problem introduced is non-determinism: You can execute the same program twice, WITH THE SAME INPUT and get different results. Why? Because the different threads/processes will be scheduled in unpredicted ways by the operating system (or virtual machine) which can yield different results. This severely increases the difficulty to test and debug software. The shared memory crowd (the shared memory model is more efficient and flexible as, well, memory is directly shared) will say that we can deal with this. The message passing crowd suggests that having some restrictions and explicit communication will make life easier (or, less complicated).

The Java crowd is where you can find the most variety of opinions, but the core JVM and Java language itself seems to follow the C/C++ philosophy (though with some candy thrown in, like the Fork/Join framework). But on top of that you can find everything with a vocal support community: Tuple spaces, Map/Reduce, Message passing, etc. This is not to say that the Python and C/C++ communities are monolithic (they are not! Just check the C implementations of MPI and PVM), but you really can find a lot alternatives with vibrant communities on top of the JVM.

A sort of middle of the ground approach was introduced de facto with the programming language Erlang: Erlang allows for multiple threads, but the communication is shared-nothing and based on message passing. I.e. while there is one single process with multiple threads, there is no shared-memory per se and all inter-thread communication is based on message passing. This Actor model based language has influenced some recent language libraries in Scala, Groovy and Clojure, among others where the actor model is the main concurrent programming model.

Many functional languages (like Erlang, Scala and Clojure) proponents also suggest that mutability (ie, the concept of variable stemming from imperative languages like C, Java, C#, Basic, C++, 99% of used languages) is not easily amenable to parallel programming and suggest that immutable data structures make life much easier: If what is shared cannot be changed then much less bugs can be introduced.

To sum it up: Some people suggest concurrent programming is difficult and it is better to minimize communication to tackle that difficulty. Others suggest that concurrent programming is workable and tightly-coupled memory-sharing systems are OK. Some also suggest (functional crowd) that immutable data structures help.

Further reading:
Concurrent computing (Wikipedia)
Scala actors – My preferred introduction to Actors (which happens to be based on Scala)
Erlang Concurrency Message passing (Wikipedia)

My opinion: Shared memory models are for real men! I am just a regular bloke, so I stick with message passing models. The complexity of bugs introduced by concurrent programming is much much worse compared to the existing sequential paradigm. In most of the cases that I have encountered, the restrictions imposed by message passing are acceptable compared to the benefits. Even with message passing and immutable data structures, concurrent programming is still very hard and bug prone (non-determinism is still quite possible with message passing). I expect (hope) that new R&D will allow us to tame this complexity. Avoid shared memory/tightly coupled systems like the plague!

Social network sharing
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • LinkedIn
  • connotea
  • FriendFeed
  • Twitter
  • Yahoo! Bookmarks

First a personal note: I’ve did not write (or doing any “things on the Internet”) for the best part of 2008. Although part of it was due to a busy schedule, most of it was due to illness (being obsessive-compulsive has some strange consequences). I finally decided to tackle my health issue (which is solvable, at least in my case).

Anyway, I’m still working in computational biology, still working with malaria, and I am still working with Groovy. So… Lets get back to the usual topics…

Before I start, a caveat: “Over-engineering”, as used below, should not be seen as scornful, we all know that traditional OO-languages and libraries try to be as general purpose and deployable in industrial software processes. In that setting, languages and libraries which present themselves in a typical OO-setting are, comprehensibly “over-engineered”.

The so-called scripting languages (for the lack of a better word, lets stick to it) are supposedly more productive than traditional languages (especially in small to medium size projects). Languages like Java are “over-engineered” beasts, seen as general-purpose, “industrial”, heavy-duty. Our beloved scripting languages fit our brain, they are agile, we can be highly productive, write less lines of code, accomplish more, be more declarative.

Can we really? Lets consider a subset of those languages, those like Groovy or Scala which were developed for the JVM (or all languages that were ported to the JVM). One of the pluses of these languages is that they can use the whole JVM ecology of libraries. The problem is that, most of those libraries are developed in a Java mentality (i.e., they are over-engineered). An example:

The fantastic JFreeChart library produces high-quality 2D charts of all kinds. It has all the flexibility that we expect from the typical Java library, you can do everything. In the Groovy landscape there is also a Builder for it, groovychart, of which I am a minor author. But, whenever I want to plot a chart, my first impulse is to use the (also great) matplotlib (CPython based). Why? Because to plot a line chart in matplotib it is 3 lines, which I remember without going to the documentation:

from pylab import *
plot([1,2,3])
show()

Really, it worked at the first attempt.

In groovychart? I am not even sure of the whole process, but it involves starting swing, preparing the dataset, choosing the chart, … And again, I am one of the authors, I do groovycharts everyday, but I still need to go to a template to do something it takes 3 lines in matplotlib. Matplotlib fits my brain, groovychart doesn’t.

While having all the Java libraries at hand is obviously a good thing, there needs to be a “scriptization” of many of those libraries. There is need for interfaces that “fit the brain”. Groovy + JVM libraries is only tackling half of the problem. Even JVM libraries with a Groovy idiom (like groovychart) don’t address the “over-engineering” problem. What is needed, in my view, are wrappers which are not only Groovy-idiomatic but also Groovy-philosophical: they fit the brain (wrappers which allow to plot a simple line chart in 3 lines).

This can actually be seen in Groovy itself for IO and many data structures: Some core Java libraries are well covered and are already available in a “fit-your-brain” interface. Hopefully we will see more interfaces like this for many existing libraries (and less like groovychart, which are only idiomatic wrappers).

PS – Another way to tackle “over-engineering” problems comes from good IDEs and, in fact, the average modern Java IDE goes to great lengths in reducing “over-burden”.

Social network sharing
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • LinkedIn
  • connotea
  • FriendFeed
  • Twitter
  • Yahoo! Bookmarks