DSLs: specification and behavior

One of the interesting applications of a DSL lies in the inherent facility to separate an abstract (domain-level) specification from possible applications. Lets make this a bit more concrete with an example (taken from my malaria domain).

As it is becoming a pattern is my recent posts, I start with a smallish explanation of the biological and pharmacological background and then I go deep in the technical DSL/Groovy design and implementation part.

Antimalarial drugs have effects on parasites (being the desired effect the killing of lots of parasites). Roughly speaking a malaria infection can be seen as a progression in time of parasite loads: Parasites are multiplying (growing) and this growth is balanced by both the human immune system natural response and the effect of drugs taken (which goes by the name of pharmacokinetics – PK). Malaria parasite loads in humans can go up to 10^12 (10 to the power of 12, no typo).

PK is modeled by a function (I won’t go into details here) which is parametrized by drug concentration and parasite response (resistant parasites tolerate drugs better). As an example for Chloroquine in Groovy:

formula: {3.8 / (1 + 1/K + CQ)}

This (for now) magic formula, represented as a closure, has a 2 parameters (1/K) which is 68 micrograms/liter for non-resistant parasites and CQ is the concentration of drug in the blood.

This is the specification of the problem. Now, what do we do with this formula? The obvious response is to use it to do calculations (i.e. given a certain drug concentration, what is the value of the PK function. But, in reality we might want to many other things with it, like generating documentation (say, by creating a Word or LaTeX document) or by converting this formula into a a faster language (e.g. Fortran) for simulation purposes. I actually do both things.

So, one thing is the formula as a specification. Another thing, is what you do with it. And we can do truckloads of different things with this specification.

Lets see how we could do some of the different tasks described above:

Calculating the value of the function

Lets imagine that we want to print the values of the function between 0 and 1800 (being 1800 ng/mL a reported maximum concentration in the blood of the Chloroquine). The solution could be:

//formula is a closure with the formula
formula.K = 1/68.0 //We set the fixed 1/K parameter
(1..1800).each { concentration ->
    formula.CQ = concentration  //Varying CQ concentration
    println formula() //Execute closure
}
//In the example above

So, in this approach we take the closure, set the parameters (setting closure properties in Groovy is very simple as the example above shows), and execute the closure repeatedly.

I actually think that this example is of the worse kind possible, because it is blending specification with execution. That is, we specify our effects formula without any behavior and the we take the specification and execute it. So we are tying specification and behavior. Pedagogical and philosophical considerations aside, this works OK, is easy to code and efficient.

Generating Fortran code

The formula above is also used to generate Fortran code with the formula representation which is plugged in a malaria epidemiology simulator. In that case executing the closure with arithmetic semantics is useless, so another strategy has to be used.

The current solution gets the code AST representation through the meta class. Before I present the solution, I will show the full representation of the (slightly altered) formula and effect:

cqEffect = effect(
    name:       "General Chloroquine effect",
    formula:    {3.8 / (1 + km1/cq) },
    parameters: [km1: 68.0] //Hoshen98 microg/l
)
//effect creates an Effect object

(So km1 is a fixed parameter for the effect and cq – drug concentration – is variable).

The Effect object has a property, called code which has the Abstract Syntax Tree (AST) for the formula, the AST is accessed in the Effect constructor in this way.

this.code = formula.getMetaClass().getClassNode().getMethods("doCall")[0].code

Short story: Gets the meta class for the closure, gets the closure class AST, and then get the AST for the code of the method doCall which has the formula code for the closure. Whew, big, long train.

Caveat: Because groovy is compiled, and for memory and performance reasons, sometimes getClassNode might return null :( . If that happens to you google for “getClassNode groovy” as that issue is out of the scope of this post (I could get around this in my cases, up to now).

So, now we have to traverse the AST. In the most general case, this would mean creating a full interpreter for the Groovy AST, a breath taking task (but a good way to learn all about Groovy ;) ). In our malaria case we will only process arithmetic expressions (and if constructs, but I will not discuss that here for brevity reasons), so we expect the users of our DSL to be careful in just passing a arithmetic expression. As such the formula is a block of statements which happens to have only a single statement composed of an arithmetic formula:

def expression = it.code.getStatements()[0].getExpression()
println expression

The first line traverses the AST to get the formula. It only works because the closure code is of the form define above (single arithmetic formula). println results in:

org.codehaus.groovy.ast.expr.BinaryExpression@186d484[
  ConstantExpression[3.8]
  ("/" at 22:22:  "/")
  org.codehaus.groovy.ast.expr.BinaryExpression@ea48be[
    ConstantExpression[1]
    ("+" at 22:27:  "+" )
    org.codehaus.groovy.ast.expr.BinaryExpression@14dd758[
      org.codehaus.groovy.ast.expr.VariableExpression@174d93a[variable: km1]
      ("/" at 22:32:  "/" )
      org.codehaus.groovy.ast.expr.VariableExpression@61a907[variable: cq]]]]

Although it looks dreadful at first, a second inspection will surface that we have what we need.

A vanilla expression processor for the AST above could be:

def drillExpression
drillExpression = { expr ->
    switch (expr.class) {
        case BinaryExpression:
            return "(" + drillExpression(expr.leftExpression) + ")" +
                     expr.operation.text +
                     "(" + drillExpression(expr.rightExpression) + ")"
            break
        case ConstantExpression:
        case VariableExpression:
            return expr.text
            break
        default: return ""
    }
}

This would return the string: “(3.8)/((1)+((km1)/(cq)))”

From here I think it is quite easy to see how one could take an expression and covert it to LaTeX or Fortran code (the remaining work is really just LaTeX/Fortran syntax).

There are 2 drawbacks from this approach: It requires work to do the AST traversing and supporting for all AST types would be daunting work. At least in my malaria case the amount of work required is very manageable.

A completely different strategy to this would be to Monkey Patch numbers (i.e. massively alter the definition of the classes) and variables in a radical way: not to produce arithmetic results but to, say, generate LaTeX sources. That is probably possible, but it would be one of the worse examples of monkey patching that I could think of. Monkey business indeed!

There is also Groovy Code Visitor pattern that I did not explore… It would be probably a variation of the AST traversal strategy presented here.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, declarative programming, groovy, malaria

by: tiago

2 Comments

Chloroquine malaria treatment and Groovy (DSL tactics in Groovy 2)

Chloroquine was, for many years, the workhorse against P. falciparum malaria. Around fifties (give or take a decade) resistance appeared in Cambodia and spread around the globe (if my memory serves me right there are at most 4 independent sources of malaria Chloroquine (CQ) resistance, being the Cambodia one the first to appear). Currently CQ clinical efficacy is deemed too low and CQ use is frowned upon. CQ is extremely cheap, therefore economically sustainable in Africa. The more current Artemisinin (ART) based drugs (ART, a short lived drug commonly used in combination with other – longer lived – drugs) are too expensive for most countries where malaria is a public health threat (thus requiring subsidies from external sources).

CQ is still used as a first line drug at least in Guinea-Bissau (On Google Scholar search for “kofoed bissau chloroquine”), even in the presence of resistance. A change of drug regimen (i.e. how the drug is used) seems to make its clinical efficacy go up and without increasing the spread of resistance. This is interesting from both a theoretical and practical point of view (being able to reuse CQ would be great given its price and wide availability). This is roughly the scope of my current theoretical study.

I am developing a Groovy model to specify CQ resistance. The fundamental concepts are:

On the drug side there are Compounds (e.g., Chloroquine) and Drugs (a drug is composed of one or more compounds, for instance, the widely used SP is composed of Sulfadoxine and Pyrimethamine. Chloroquine (as a drug) is composed of… Chloroquine – A single compound drug).

On the parasite side there are enzyme (protein) mutations. A mutation might help the parasite in tolerating a certain drug.

So here is my current piece of Groovy code to model CQ resistance:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
cq = compound(name: "Chloroquine", abbreviation: "cq", halfLife: 45.d)
 
CQ = drug(name: "Chloroquine", abbreviation: "CQ")
CQ.includes cpd: cq, qty: 300.mg, bioavail: 1.2
 
regimen = regimen()
regimen.take drug: CQ, qty: 2, at: 0.h
regimen.take drug: CQ, qty: 1, at: 6.h
regimen.take drug: CQ, qty: 1, at: 1.d
regimen.take drug: CQ, qty: 1, at: 2.d
 
CRT = protein("CRT")
CRT.mutatingAmino 76, Lys, Thr
 
cqEffect = effect(
    name:       "General",
    formula:    {3.8 / (1 + km1/cq) },
    parameters: [km1: 68.0]
)
 
cqResistance = resistance(
    effect:     cqEffect,
    mutations:  [CRT.mutation(76)],
    parameters: [km1: 204.0]
)

Chloroquine has a terminal half life (roughly the time that the body takes to eliminate half of the drug concentration) of 45 days (line 1). Actually, it is quite difficult to estimate half lives (and they vary from case to case). CQ is estimated to be between 1 and 2 months (extremely long).

A typical CQ pill has 300 mg of the substance (line 4).

A possible CQ regimen is, for an adult, 2 pills on the first day. 1 pill 8 hours later, 1 pill the 1 and 2 days after. Lines 6-10.

Resistance is related, among many other things to codon 76 of the CRT (Chloroquine resistance transporter) lines 12-13.

Looking at the code until line 13 I would say that is pretty readable and an elegant representation the problem. From line 13 onwards I think the same holds, but for now I will not discuss pharmacokinetics (I also refrained from explained the simplistic bioavailability parameter on line 4).

In the next posts I will concentrate on line 17, a formula for the pharmacokinetics (PK is mainly the killing effect of the drug on the parasite) of CQ. Sometimes I will be more of a computer geek and concentrate on the Groovy side of things, sometimes I will discuss more the underlying biology and pharmacology.

By the way, and going in the geek direction, why do optional parenthesis become mandatory inside list? i.e., I can do

DHFR.mutation 108

But I need parenthesis here:

[DHFR.mutation(108)]

The same seems to be happen when calling functions scoped inside a script (in the DSL example above, line 1 requires parenthesis).

By the way, that DHFR thingy above? DHFR is an enzyme involved in malarial resistance to SP, the other widely deployed cheap drug. SP acts in a less obvious way, and that will require changes to the DSL (to have relationships among effects), but that is further down the road.

Appendix:

One interesting Scala syntactic goodie that Groovy could plagiarize is this:

import org.jfree.chart.plot.{PlotOrientation, XYPlot}

From the snippet above you might infer that charts will be appearing in future posts ;)

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, declarative programming, groovy, malaria

by: tiago

3 Comments

Malarial drugs and the economics of (human) languages

There is some interesting lack of precision, to the point of “error” on the way some concepts are dealt with by human language.

Take, for instance, the concept of drug half-life, i.e. the time that it takes for the concentration of a drug to drop to half (drug concentrations in the blood are normally modeled through exponential decay), it is conceived as a property of the drug – people talk about drug D has an half-life of H hours – but it is really a property of both drugs and individuals (actually is much more complicated than that, we could repeat the argument).

And no, this has not only to do with statistical deviations that are acceptably approached by the drug only.

As example, there is a study about the pharmacokinetic properties of Sulfadoxine-Pyrimethamine (a widely used cheap antimalarial). In this study, there is a big deviation for half-life (and other parameters) for the children between 2 and 5 years. The study concludes that “dose recommendations need revision” for that group. To put in another way, half-life (and other parameters) is not (only) a function of the drug.

Now, I am not suggesting that the concept of half-life tied just to the drug should be thrown away. I am just speculating why it is framed as a function of the drug only, as clearly that is not the case.

First there is probably historical inertia: The concept was first framed that way at a time that it seemed that half-life was only dependent on the drug and it stuck by “memetic” inertia.

But, much more importantly, it is still there because, it is both less expensive (it is easy to express half-life as a function of just the drug, than other parameters which might be still crucial in some situations) and still meaningful enough in many contexts (for instance, expressed as a function of drug it is still useful to compare the half-life of Artemether – short – against Sulfadoxine – long – for many kinds of reasonings). Even when the most economical concept entails some errors it might still be practical. The problem only arises when its simplicity has bad consequences (in this case, having wrong drug doses)… but, in certain contexts, it might be a problem, a serious problem (See my previous text about the notions of resistance, tolerance and sensitiveness for an example).

It all depends of the discourse context, but one should be careful.

As an anecdotal example if you are seriously ill and a doctor prescribes you a pill, do you prefer to hear “this will cure you” or “this will drop the parasite load at a rate of 1 order of magnitude per hour starting 3 (90% CI of 2.5 – 3.5) hours after intake. Parasite load is expected to drop to 0 in 10 hours”?

The problem arises when the cognitive bias of the simplicity of “this will cure you” gets into more rigorous contexts.

This has implications on the computational modeling of concepts. The tradition in computer science it to “dig down” to the “real meaning” of concepts. In that sense simpler explanations are deemed “wrong” (and should be rewritten in terms of “correct” conceptualizations). Maybe a different strategy is needed, one that takes some linguistic and cognitive economy to computational systems (while still maintaining rigorous and precise reasoning and conceptualization when that is needed – like human languages can do).

I am going to stop here, but I think that one of the problems that impairs mathematical modeling is the application of the “certainty of numbers and formulas” to non-rigorous concepts. Then you have the worst of both worlds: an authoritarian argument (mathematics is a foundation for authority. “The numbers prove it”) based on modeling vague, imprecise and wrong concepts. But that is a topic for a another post.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, cognition, malaria, science

by: tiago

No Comments

Holy Grail: The quest for THE programming language

Being a computer scientist with a strong interest in languages (languages in the broadest sense possible: programming, natural and cognition related issues), I am in an holy grail quest for a programming language that:

First and foremost allows me to express my computations in a way that is close to the problem domain (as opposed to close to the machine). As I am working in a biology setting that means being able to talk about concepts around genes, epidemics and pharmacology in my programs. I don’t want to think about CPUs, memories and things like that when I am coding. Prolog and Lisp are good examples here. I also need programs that can evolve over time as knowledge changes, I need strong metaprogramming and Domain Specific Language facilities.

Unfortunately I have a couple more requirements coming from the day to day reality…

Real world: I want a language that interacts with existing libraries and that I can easily make available to other people to use, inspect and change. I need Bio* libraries, graphics plotting libraries. I my personal case I decided that I want to work inside the JVM, so I need a language that works in the Java world (Jython, JRuby, Scala, Groovy, … Java).

Software engineering: Programs have to be easy to maintain and debug. I guess there is no way around explicit typing on the debug and tool construction front.

Ridiculous religious fanatic quest? Yes, it might be, but I am pursing it.

The truth is that we are not far away from this grail.

Scala is almost there. Lacks metaprogramming and things like type inference are a bit amateurish (compare it with CAML).

JRuby is maybe there, I could live with it, I guess. The lack of explicit typing will make things difficult in the long run on the software engineering front.

I decided to give a final try to yet another language: Groovy, and up to now it is going very OK. Seems to nail all the fundamental points. I especially love the effort on good metaprogramming facilities.

I decided, for pragmatic reasons, that after this one I will stop my pursuit for the grail. If Groovy proves a blunder of some sorts I will revert to JRuby and carry on.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, declarative programming, groovy, metaprogramming, science, software engineering

by: tiago

6 Comments

Bio.PopGen

I am currently developing the Biopython module for population genetics and genomics (by the way, you are invited both to help with the development and to make suggestions – maybe based on your needs – for new features).

On the current (1.44) version of Biopython, a GenoPop parser and code to deal with FDist (a Fst outlier method for selection detection) is available.

It is my pleasure to announce that coalescent simulation (in the form of support for the SimCoal2 simulator) is currently available on CVS and will probably be out on the next public version. This includes, code, test code and DOCUMENTATION. This means you can now do coalescent simulations from inside Biopython (many demographies and markers supported).

Future plans for Bio.PopGen include statistics (the meat of the module, actually) and HapMap support, among others.

Need any feature? Just ask. I cannot promise it, but I will try to address user requests in as much as possible

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Python, bioinformatics, biopython, population genetics

by: tiago

No Comments

Automated GUIs for OO models and DSLs

One of the most delightful things in bioinformatics is the possibility of working with people with really different mindsets. Surely CS geeks are amazing, and everyday I feel that my original background is really a comparative advantage, but, from where I look, nothing beats being in an environment with scientific and cultural diversity. But, lets talk some geekiness now:

A couple of years ago, I did a population genetics simulator in Caml. It was really flexible, allowing for many demographic and genomic scenarios, mating rules, selection… really flexible. I never got to try to publish it because there are many good simulators around (I suggest simuPOP, if you are looking for one) and it would take some time to make it robust and documented for public exposure. But, the interesting part is, when I went to my MSc supervisor (an “old-type” biologist) and after a very exuberant explanation on how flexible the simulator was, he added only one comment: That is all very well and good, but you did not show me the easy to use graphical interface!

Fast forward a couple of years… With regards to a DSL to model drug resistance in the context of infectious diseases that I am developing, I went to my PhD advisor (a population geneticist, malarialogist, biostatistician who knows how to program in C), showed him my rough prototype and he said: People will be able to read this, but, to interact they will want an easy to use graphical user interface. To be honest, this time, I was expecting the comment (I am living in the middle of experimentalists long enough to have learned something). I have no expectations, for my DSL, that domain specialists will write it (well, maybe a couple of them will, if things pick up). If I end up giving my system away to domain specialists, it will have to have a easy to use interface, there is no escaping from that.

Well, DSLs (at least in Scala and in Ruby) have an underlying OO model. Which, most of the times is neither complex nor big. I am starting to suspect that it won’t be too difficult to automatically generate an easy to use interface to input in a “nice” way what could be rendered as DSL programs (or object instances and relationships, if you prefer to look at it that way). For embedded DSLs, which have the whole expressive power of the host language available, that would be unfeasible to do completely. But, at least part of it could be automated. Obviously this idea is not new at all, this is just a rehash of what Lift or Rails do for databases.

I am aware that graphical programming languages never went too far (I actually dislike them), but the scope and context here are completely different, different premises apply. This might be one way of lowering the barrier to rigorous modeling to a wider crowd.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Ruby, Scala, bioinformatics, declarative programming, science, software engineering

by: tiago

No Comments

Biopython’s population genetics module

I would like to make a preemptive defensive comment on the new population genetics module. ;)

I am, for now, the sole author of the code that is there (although, in future versions there will be at least code from another person. By the way, if YOU want to participate, your 100% welcome). Although the code is mine there was a lot of help from Peter Cock, one of Biopython’s core developers. Without him, this initial groundwork would not have been possible.

Now for the preemptive defense :

If you look at the module, it has very little functionality included. This is a very deliberate strategy to start small and grow slowly. I am expecting for some feedback (which will be very little, I am sure). I want to grow in small steps, including as much feedback as possible. Test code and documentation have to exist before releasing anything to the public.

In the pipeline there is code for coalescent simulation, statistics (including code supplied by Ralph Haygood, that I am joining with my own) and HapMap. If you are interested in early access to any of this code, please give me a shout as most of it already exists. Alpha testers are more than welcome ;) .

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Python, bioinformatics, biopython

by: tiago

No Comments

Modeling drugs in Scala

I am currently trying to model antimalarial drug behavior in order to understand the spread of drug resistant malaria. Generally speaking, malaria strains are more or less tolerant to a drug depending on the quantity of drug that is necessary to kill an infection. In theory, a totally resistant infection will survive any treatment, a totally susceptible one will only require small levels of drug to be cleaned.

I see the word drug used in two different ways (for the readers of this blog that are specialists, in some form, on issues regarding drugs, particularly pharmacokinetics, if you see any thing particularly wrong, please do inform me): For instance, SP (Fansidar) is a drug, composed of two drugs (Sulfadoxine and Pyrimethamine). I will use drug for SP and compound for S and P (as active compound seems to be used).

Antimalarial drugs work mainly in the blood stream against asexual parasite forms.

In the blood, compounds have a certain concentration. With time, the body gets rid of compounds (thus the concentration of a compound goes down with time). The concentration of compounds is normally (but not always) modeled using an exponential decay function, being the fundamental parameter the half-life, i.e, the time that it takes for the concentration of a compound to drop to half.

Two other important concepts for drugs that are not taken intravenously (like cheap antimalarials which are oral), are

  1. Bioavailability, i.e. the fraction of the compound that actually reaches the circulation. It seems that one of the problems with counterfeit drugs is low bioavailability. Bioavailability is normally discussed in terms of AUC (Area Under the Curve. Being the curve related to the plot of drug concentration against time). I will model it in terms of maximum concentration, half-life and the time it takes to reach maximum concentration in the blood, which by the way is the next concept…
  2. The time it takes to reach maximum concentration in the blood, i.e. the time from ingestion to circulation in the blood at maximum concentration. I suppose this time frame has a technical name, but I don’t know it (if you know, drop me an email our comment, please).

Now, back to computational modeling:

A big objective is declarative programming. Preferably a program that can be read by domain specialists (biologists, MDs, biostatisticians, …), with that in mind…

Currently, a computer program in Scala to model drugs look like this.

Compound create "Sulfadoxine"
Compound abbreviation "S"
Compound half_life 116 //hours
Compound bio_availability 408 //1mg to nanoM
val Sulfadoxine = Compound prepare
 
Compound create "Pyrimethamine"
Compound abbreviation "P"
Compound half_life 83 //hours
Compound bio_availability 34 //1mg to nanoM
val Pyrimethamine = Compound prepare
 
Drug create "SP"
Drug includes Sulfadoxine quantity 500
Drug includes Pyrimethamine quantity 25
val SP = Drug prepare

Discussion:

  • I am using the “object companion” pattern a lot. The idea is that all “stateful” mess is stored “prepared” in the object (which is the DSL source). When the prepare method is invoked in the object a class (with only immutable vals, very lovely for those of you who are functional programming enthusiasts) is created.
  • Notice the dependence on operator precedence on Drug includes quantity (there is not really one, strictly speaking, but assume there is). I would really like to have, per class the ability to define operator precedence, other than not based on dictionary order (à la Prolog).
  • I don’t like the val SP = Drug prepare. It is too verbose and too geeky. I would prefer just Drug prepare. I believe that this is possible in Scala as at least at the interpreter level (as the Scala interpreter does it), but I still don’t know how. The idea would be that a val named SP would be added to the local scope in some way. For those computer inclined readers that think that I am being too pedantic and nit-picking, I just have one thing too say: I am really trying to make the system the most pleasant possible to non programmer types, and I think my proposal does not sacrifice elegance and generality (although I would recognize the non-explicit name creation is “strange” – but, hey, the Scala interpreter already does it!)

Caveat emptor, big one: Although drugs (compounds) are discussed in terms of half-lives, bioavailability, etc… these properties are actually not of the drug but of the interaction between the drug and the individual. Making them drug properties only is a “cognitive abuse”, although it has its uses. For instance, my advisor, after looking at the language, was talking about bioavailability for counterfeit drugs, for children between 2 and 5 years. A great example that they are not properties only of the drugs but also, at least, of individuals (and not only that, for instance many drugs are more bioavailable if there are taken in conjunction with, say, fatty foods).

A proper, precise, computational modeling of drugs would be a gigantic undertaking. I have a different approach: Modeling as close as possible to the average domain discourse and hook, in some way, the necessary precision, should the need arise. It is worth noting that “incorrect”, “imprecise” modeling is enough for many tasks.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Scala, bioinformatics, declarative programming, malaria, metaprogramming

by: tiago

2 Comments

Learning Scala: Types

I will document my effort in learning Scala. The main objective is to help in building up as much content as possible on the Web about Scala. Caveat though: I am learning, therefore what I document here might be rubbish.

The problem: I need an arithmetic expression parser to be able to deal with drug isobolograms like these 2 taken from Wang et al.:

Isobologram

Very roughly speaking, the concentration “curve” that you see is when the drug combination becomes active against the parasite. In this case the compounds are Sulfadoxine (SDX) and Pyrimethamine (PYR) used to combat P. falciparum malaria.

I want to be able to build expressions for those curves. These expressions run in contexts, that is, if one “curve” (here approximated by a line) is 133*SDX + 2000 – PYR then there is a need for having a variable SDX and another PYR.

[People with a functional programming background might immediately recognize the typical newbie exercise of doing an expression evaluator... In fact you can find it in various Scala documentation... and Caml]

So I need, what is called an environment, a store of mappings from symbols to values. Something like

{case "PYR" => 1800.0 case "SDX" => 5.0}.

First problem: Doubles and Ints. There seems to be no way to specify that the return is a Number. I would like, sometime, to use Numbers irrespective of the specific subclass. I would like to do something like:

type Exp : String => Number

The problem is that one cannot have type at all, unless inside a class. But, but, but the type is not visible outside the class as a type (i.e. just for the type info, without the need to instantiate to access). Maybe putting it into an object? Did not try it, but it would be clumsy.

I ended up with this solution for now:

trait Env {
  def apply(name : String) : Double
}

Can live with it. The Double instead of Number hurts more.

Again, I am beginning. Please don’t be too harsh on me ;)

By the way: It would be nice to have Scala support on GeSHi so that WordPress blogs could render Scala code beautifully. No, I am suggesting only, I don’t have the time to actually do it myself.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Scala, bioinformatics

by: tiago

No Comments

Scala for bioinformatics

I am seriously considering doing the core of my work (at least when I have the freedom to decide) in Scala. The reasons? Well, I can give them in the form of requirements:

  1. Domain Specific Language support, that is:
    1. Making life easy on declarative programming
    2. Ability to show the code to non-programmers in a form that is readable and understandable (I will talk about this topic a lot in the future).
  2. DSLs should be embedded and not stand-alone. A DSL (say, one to model the spread of malaria drug resistance) can be made in any programming language, really. But embedded languages (i.e., where the DSL resides inside the host language) cannot really be done in most languages. This allows for “unlimited” extensibility (Turing completeness some would say). Prolog is still my favorite here.
  3. Availability of a wide range of libraries (think math libraries, chart libraries, bio libraries). All JVM based languages can use Java libraries. This more or less kills Prolog, Caml and Haskell.
  4. Easy multi platform support. Think Linux, Mac OS X and Windows. With not much pain. Kills most non-VM languages and “system” languages (C, C++, Fortran).
  5. By the way, I refuse to malloc. I was born in the 70s, not retired in the 70s.
  6. Lively, clever and helpful community.
  7. Strong-typing, better yet, strong typing with type inference. I don’t think typing in traditional “scripting” languages scale when the code base grows, it is overrated (think Ruby, Python, Perl and friends), debugging becomes a mess. Caml wins here. Scala type inference seems to sometime fail (i.e. requiring the programmer to explicitly specify the type). Java type of languages force you to always be verbose, that kills productivity.
  8. The language should be seem by the creators mainly as a production vehicle and not as a research vehicle. A big no-no to Prolog here. Haskell goes the same way. Scala seems to strike a reasonable balance. I need to produce reliable code, I require a reliable compiler/interpreter.
  9. I have a strong bias towards the JVM: Open source and open development process (Java Community Process), robust, widely supported, massive user base. .NET, being in practice vendor locked (I don’t think Mono is really a viable alternative as MS really controls whatever they want to control) is out. At the end of the day I also have a soft spot for Java. There are many things that I don’t mind doing in Java.
  10. Introspection. Caml fails here. I actually don’t know how Scala fares here, but at least JVM mechanisms are enough for me.
  11. Striking a good balance between cognitive freedom and damage control on bad code design. As an example, Java gives little freedom in regards on how to express your ideas. Perl, on the other hand allows you to do a big mess (without really giving you expressive power, actually). Functional and logic programming languages shine here.
  12. Over engineering might be good support all possible use cases, but it is a productivity disaster to code in. I am thinking Java here. All libraries are difficult to use by design. Even 3rd party libraries seem to be designed mostly with a complexity culture in mind.

Scala seems to be the option that tackles most issues. To be honest I was always frustrated with all languages because they missed a crucial point in a big way. Prolog is too “researchy”, Haskell also, C too low level, Java too verbose and too freedom-curtaining. Perl and C++ are a complete mess (although in different ways).

Python is almost there (Major: Jython lags. Minor: weak DSL and functional-paradigm support). JRuby is probably there. Scala is probably there. My gut feeling points to try out Scala.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Scala, bioinformatics

by: tiago

8 Comments