Reducionism and simplification

This post starts with what might seem as a discussion about computing, but is actually a poor man’s discussion about philosophy of science and has nothing to do with computing, it is much more applicable to biology, economics and sociology.

Lets be honest, computer scientists are trained to work for banks and insurance companies, to make web sites, software for cars and things like that. Those domains are actually very simple. A bank might be a gigantic institution, but it is possible to capture, granted with a lot a effort, all its processes inside a computer program. This creates a mental setting: Everything that we need to know is possible to be known: we just decide when to stop.

Now think about those simplistic (this is an understatement) mathematical and computational models for scientific problems (differential equations, Monte Carlo processes, Markov Chains, …). They model the “important parts” of the issue under study. These models are much simpler than the models working in computers to sustain day to day banking chores. Somehow it strikes me as strange that something as mechanic as a bank needs a more complex model than “nature”.

In the context of nature and making mathematical and computational models about it, I have a few things in mind:

First of all, in many problems in the natural world we don’t know what are the important parts to start with. This is very different from the “bank mentality” when you can know everything if you try hard. In my personal case, when I model malarial artesunate resistance, I am modeling something that people speculate how it works, and even if the speculation is correct most of the fundamental parameters are unknown. I am still to read a paper modeling something related to malarial drug use that doesn’t have a phrase like: “the relation between this value is and reality is assumed to be this (no citation - or citing something unpublished - or rationale provided)”.

But the cornerstone of my reasoning is that, in complex processes, the devil is in the details and in the interactions between participating factors (most of which we
are unaware of). Soft sciences are holistic by nature. The property of the whole system comes from the everything and everywhere. The “banking” and “hard science” mentality are no good here, we cannot know everything, what we know is probably not enough, and most simplifications will lose something fundamental.

Does this means that I am suggesting that we should stop modeling and all theoretical work? By no means, but we should refocus:

  • This is not hard science, don’t try to mask it as such. Hard rules, sensitivity analysis are mostly artifacts to make things look more “serious” and more “demonstrated”. This is biology (or even “worse”, economy or sociology), you don’t c.q.d. here.
  • Think you can forecast the future? You think you can… thaen bring me a always correct forecast of the weather in 2 months and I will listen to you. Most models that exist to forecast the future are there because they are very hard to disprove TODAY: climate (as opposed to weather) models, epidemiology, … . The vast majority of models that can be tested fail (think mathematical finance and the current subprime crisis in the USA, think weather predictions…).
  • Theoretical work, although not being able predict the future (or explain the past) might help create a cognitive and linguistic framework for discussion: present the fundamental concepts and narratives underlying the research process, make the discourse clearer, less cloudy, point dangerous imprecisions. This is actually the inverse that what happens now: theoreticians speak in a language that most people struggle to understand.
  • Theoretical work can create interesting questions for field scientists to try to answer: It is the precise inversion of what happens now: We don’t want models that are cheated to look realistic. We want reasonable models that fail miserably so that we can ask field scientists: This is failing, why do you think this happens? Have you considered this other hypothesis? What about testing it?

<sarcasm>
The existing modeling culture is quite good in the current scientific setting: Makes theoreticians look intelligent with all those complicated mathematics and computer programs (and associated publications) and excuses “practical” scientists of even trying to use their brains: They just apply the existing theory in a process that is more industrial then creative to their research questions. The biggest example that I know of this is phylogenetic analysis: Get data from the field, compute a mutation model from the premise that a small genetic distance is better, burn CPU cycles, publish - You don’t even need a human for this - a trained monkey is probably enough.
</sarcasm>

In economics things are a bit worse: elaborate game theories and such are presented as a “hard, undisputed” justification for an economic theory serving some nice agenda. Nothing more than a authoritarian argument.

PS - If you work in an hard science like physics or chemistry you might be thinking that I am smoking something very strong. I don’t think that this post applies to hard sciences, that is a different game altogether.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: biology, science

by: tiago

1 Comment

Malarial drugs and the economics of (human) languages

There is some interesting lack of precision, to the point of “error” on the way some concepts are dealt with by human language.

Take, for instance, the concept of drug half-life, i.e. the time that it takes for the concentration of a drug to drop to half (drug concentrations in the blood are normally modeled through exponential decay), it is conceived as a property of the drug - people talk about drug D has an half-life of H hours - but it is really a property of both drugs and individuals (actually is much more complicated than that, we could repeat the argument).

And no, this has not only to do with statistical deviations that are acceptably approached by the drug only.

As example, there is a study about the pharmacokinetic properties of Sulfadoxine-Pyrimethamine (a widely used cheap antimalarial). In this study, there is a big deviation for half-life (and other parameters) for the children between 2 and 5 years. The study concludes that “dose recommendations need revision” for that group. To put in another way, half-life (and other parameters) is not (only) a function of the drug.

Now, I am not suggesting that the concept of half-life tied just to the drug should be thrown away. I am just speculating why it is framed as a function of the drug only, as clearly that is not the case.

First there is probably historical inertia: The concept was first framed that way at a time that it seemed that half-life was only dependent on the drug and it stuck by “memetic” inertia.

But, much more importantly, it is still there because, it is both less expensive (it is easy to express half-life as a function of just the drug, than other parameters which might be still crucial in some situations) and still meaningful enough in many contexts (for instance, expressed as a function of drug it is still useful to compare the half-life of Artemether - short - against Sulfadoxine - long - for many kinds of reasonings). Even when the most economical concept entails some errors it might still be practical. The problem only arises when its simplicity has bad consequences (in this case, having wrong drug doses)… but, in certain contexts, it might be a problem, a serious problem (See my previous text about the notions of resistance, tolerance and sensitiveness for an example).

It all depends of the discourse context, but one should be careful.

As an anecdotal example if you are seriously ill and a doctor prescribes you a pill, do you prefer to hear “this will cure you” or “this will drop the parasite load at a rate of 1 order of magnitude per hour starting 3 (90% CI of 2.5 - 3.5) hours after intake. Parasite load is expected to drop to 0 in 10 hours”?

The problem arises when the cognitive bias of the simplicity of “this will cure you” gets into more rigorous contexts.

This has implications on the computational modeling of concepts. The tradition in computer science it to “dig down” to the “real meaning” of concepts. In that sense simpler explanations are deemed “wrong” (and should be rewritten in terms of “correct” conceptualizations). Maybe a different strategy is needed, one that takes some linguistic and cognitive economy to computational systems (while still maintaining rigorous and precise reasoning and conceptualization when that is needed - like human languages can do).

I am going to stop here, but I think that one of the problems that impairs mathematical modeling is the application of the “certainty of numbers and formulas” to non-rigorous concepts. Then you have the worst of both worlds: an authoritarian argument (mathematics is a foundation for authority. “The numbers prove it”) based on modeling vague, imprecise and wrong concepts. But that is a topic for a another post.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, cognition, malaria, science

by: tiago

No Comments

Holy Grail: The quest for THE programming language

Being a computer scientist with a strong interest in languages (languages in the broadest sense possible: programming, natural and cognition related issues), I am in an holy grail quest for a programming language that:

First and foremost allows me to express my computations in a way that is close to the problem domain (as opposed to close to the machine). As I am working in a biology setting that means being able to talk about concepts around genes, epidemics and pharmacology in my programs. I don’t want to think about CPUs, memories and things like that when I am coding. Prolog and Lisp are good examples here. I also need programs that can evolve over time as knowledge changes, I need strong metaprogramming and Domain Specific Language facilities.

Unfortunately I have a couple more requirements coming from the day to day reality…

Real world: I want a language that interacts with existing libraries and that I can easily make available to other people to use, inspect and change. I need Bio* libraries, graphics plotting libraries. I my personal case I decided that I want to work inside the JVM, so I need a language that works in the Java world (Jython, JRuby, Scala, Groovy, … Java).

Software engineering: Programs have to be easy to maintain and debug. I guess there is no way around explicit typing on the debug and tool construction front.

Ridiculous religious fanatic quest? Yes, it might be, but I am pursing it.

The truth is that we are not far away from this grail.

Scala is almost there. Lacks metaprogramming and things like type inference are a bit amateurish (compare it with CAML).

JRuby is maybe there, I could live with it, I guess. The lack of explicit typing will make things difficult in the long run on the software engineering front.

I decided to give a final try to yet another language: Groovy, and up to now it is going very OK. Seems to nail all the fundamental points. I especially love the effort on good metaprogramming facilities.

I decided, for pragmatic reasons, that after this one I will stop my pursuit for the grail. If Groovy proves a blunder of some sorts I will revert to JRuby and carry on.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, declarative programming, groovy, metaprogramming, science, software engineering

by: tiago

6 Comments

Automated GUIs for OO models and DSLs

One of the most delightful things in bioinformatics is the possibility of working with people with really different mindsets. Surely CS geeks are amazing, and everyday I feel that my original background is really a comparative advantage, but, from where I look, nothing beats being in an environment with scientific and cultural diversity. But, lets talk some geekiness now:

A couple of years ago, I did a population genetics simulator in Caml. It was really flexible, allowing for many demographic and genomic scenarios, mating rules, selection… really flexible. I never got to try to publish it because there are many good simulators around (I suggest simuPOP, if you are looking for one) and it would take some time to make it robust and documented for public exposure. But, the interesting part is, when I went to my MSc supervisor (an “old-type” biologist) and after a very exuberant explanation on how flexible the simulator was, he added only one comment: That is all very well and good, but you did not show me the easy to use graphical interface!

Fast forward a couple of years… With regards to a DSL to model drug resistance in the context of infectious diseases that I am developing, I went to my PhD advisor (a population geneticist, malarialogist, biostatistician who knows how to program in C), showed him my rough prototype and he said: People will be able to read this, but, to interact they will want an easy to use graphical user interface. To be honest, this time, I was expecting the comment (I am living in the middle of experimentalists long enough to have learned something). I have no expectations, for my DSL, that domain specialists will write it (well, maybe a couple of them will, if things pick up). If I end up giving my system away to domain specialists, it will have to have a easy to use interface, there is no escaping from that.

Well, DSLs (at least in Scala and in Ruby) have an underlying OO model. Which, most of the times is neither complex nor big. I am starting to suspect that it won’t be too difficult to automatically generate an easy to use interface to input in a “nice” way what could be rendered as DSL programs (or object instances and relationships, if you prefer to look at it that way). For embedded DSLs, which have the whole expressive power of the host language available, that would be unfeasible to do completely. But, at least part of it could be automated. Obviously this idea is not new at all, this is just a rehash of what Lift or Rails do for databases.

I am aware that graphical programming languages never went too far (I actually dislike them), but the scope and context here are completely different, different premises apply. This might be one way of lowering the barrier to rigorous modeling to a wider crowd.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Caml, Ruby, Scala, bioinformatics, declarative programming, science, software engineering

by: tiago

No Comments

Pissed with Connotea

I am (was?) a Connotea user (my Connotea bib reference still sits on the right column of this blog). Today, as of writing a report, I tried to import my references from Connotea… The authors of many papers are simply not there. I tried RIS and bibtex (what I use) export, nothing. The authors are not there (I inspected the bibtex export). Yes, maybe I should have checked the quality of the citation, but the point of selecting a DOI (which I always did when clicking my “Add to Connotea” bookmarlet) is not to take away the burden of specifying all important details?

I don’t really like to write on the negative, but, in fact, I had a few more problems with services from the Nature Publishing Group:
1. Performance and downtime: Sometimes to submit a citation takes ages, or the service is down (although, regarding downtime, it seems to be getting better).
2. Postgenomic: When I tried to use it, it was mostly down. Want anedoctical evidence? If you search now (as of the posting date of this entry) for postgenomic on google, the cached page says: “Unable to select database”, nothing more.
3. Postgenomic again: A few months I submitted my blog. I got no answer at all.

Well, let me go back to my report and to manually correct entries downloaded from Connotea into JabRef (JabRef, by the way, can download PubMed entries automatically and correctly).

Updated with screenshot of Google cache (click the thumbnail for full sized screenshot):

Google cache of postgenomic

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: science

by: tiago

3 Comments

Conservation Genetics Data Analysis Course

There won’t be many posts here during the next couple of weeks as I am one of the organizers of the Conservation Genetics Data Analysis Course. Feel free to have a look at the website. Comments are most welcome.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, science

by: tiago

No Comments

Comments to Alexei Drummond’s interview on Blind.Scientist

After a somewhat “rantish” response to Alexei Drummond’s interview on Blind.Scientist I have put in a few more well tempered comments to the interview. The reason I reacted so fast (and so unwisely) is that because most of the content impacts directly with what I am currently doing.

So here are my comments to the interview, commenting each point that interests me:

When biologists start asking about where they can learn to program a computer, just so they can do their job you know something is wrong!

This line of thinking seems to be highly pervasive with lots of researchers in Biology/Computational Biology/Bioinformatics. This is my main point of disagreement. I do think that in this brave new world everybody will have to know how to do basic scripting. I am not saying doing an industrial-strength application. Just doing basic data moving and processing. Like maths become in the 20th century a fundamental tool, basic programming will become one also. Especially when more data becomes available and lab work becomes more automated and fast/easy to do.

Firstly, software development isn’t science.

This I completely agree, although I suppose the author is not referring to some of the underlying algorithms that are below the application (like an alignment algorithm). But, just doing an application is not science, it is enabling science, which is quite different.

Secondly, most academic programmers are not interested in (or good at) designing user interfaces, and certainly developing software is not a scientific outcome that gets recognized like publishing papers does.

Designing a good user interface is surely something I don’t think should be required by scientists when I talk that doing basic scripting is becoming a requirement. It is interesting to note that one can publish papers on applications (program notes, application notes, …), so one can make a “scientific” CV with applications. If it makes sense to use the same reward system for applications and research papers is a completely different issue, but, for now one can get publication entries on the CV with applications.

Thirdly, academics are quite bad at supporting software and documenting it.

Most are bad at developing software in the first place. Using publication as a reward for an application might make sense (if at all) after an application is well established, but I doubt it makes much sense in the beginning of the life cycle of the application. Call me cynical, but on this publish or perish culture, putting the reward in the beginning of the life of a product is a strong invitation not to support it at all (as the main reward is already obtained).

So it seemed to me, that for a lot of reasons, a professional software company was the best avenue to realize a software system that would dramatically improve the productivity of molecular biologists by putting bioinformatics at their fingertips.

Makes full sense, but the idea that all of the programming effort can be taken from the hands of scientists seems to me exaggerated. My main line of reasoning is that most science is a creative process, not a factory process, and some of that creativity cannot be foreseen by application developers, so, some “tweaking” will be needed by the final user (even in less creative professions sometimes word processors and spreadsheets have to be programmed), that tweaking is really something like “script programming”.

Java is a general-purpose programming language — so you can do in Java pretty much anything you can do in software. The main reason for choosing Java is that it is very easy to write sophisticated user interfaces that run on Windows, Linux and Mac OS X.

I currently use the same line of reasoning when developing software. I am currently working on a selection detection application that works inside JVM. Note that I say JVM and not Java. One can use the good things (portability of libraries, especially Swing and AWT) of the JVM using other languages that work on JVM, Jython and JRuby come to mind.

While I do subscribe to the JVM almost completely, I have some doubts about Java. For small applications it is clearly an over engineered language. Even for big applications, although I think some of Java features are good (like explicit typing), there is space for extensibility to be provided by scripting languages like Jython/JRuby (MODELER4SIMCOAL2 works just like that).

For biologists beginning programming (another issue), I would surely not start by teaching Java because of the excessive verbosity and difficulty in getting “simple things done” that puts off a lot of people. Furthermore the learning curve is steep. Python would be my clear suggestion on this front.

Our goal is a happy marriage where academic programmers can get on with developing great new algorithms, and Geneious can provide the interoperability, the user interface and the support.

One of the best ideas I have read in a long time. There is a big difference in thinking an algorithm and the process of developing an industrial strength, easy to use (and, I would like to add, script and extend) application plus maintain/support it. The reward for the algorithm that makes more sense to me is the publication, the reward for the application should be money, to put it in simple terms. The bridging between the two sides of the equation can (should) be done in the way Alexi proposes.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, science

by: tiago

5 Comments

Blind review process

I am currently reviewing five papers for a conference which uses blind review (I will blind you of the conference name ;) ).

On 2 papers, the author forgot to remove his ID.

On 2 papers, googling for necessary content to perform the review trivially identifies the authors (like, there is only one result for a certain query on scholar).

On 1 paper, there is a link for a a personal homepage where you can find datasets related to results.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: science

by: tiago

No Comments

Average quality of science information sources

Nowadays most of my science readings come from Google Reader. I have mainly 2 folders: “bio” for scientific journals and “bioblogs” for Blogs.

I can “read” “bio” in a half a day per week or so (I have more than 30 feeds). I need almost a day for “bioblogs” (less than 15 feeds). The number of articles on “bio” is probably one or two orders of magnitude higher then the one on “bioblogs”.

To put in another way: I consider the blog content to be of much higher quality than the scientific journal counterpart…

…And I think I know the reason…

…Currently, scientists are mostly evaluated by their ability to publish in scientific journals (that postdoc grant, financing a project, getting a tenure…), so there is strong pressure to publish. People publish everything that is, in their self evaluation is… good publishable.

On the other hand, blogging is still mostly something people do because they want and they feel is valuable.

Of course, we can already see, informally, that people profit from blogging (make contacts, publicity, etc). I would bet that in the near future, blogging will be in formal evaluation processes also (I can already smell measures of quantity of blog posts, inbound links and stuff…).

Care to make a prediction on the average quality of blog content in the future?

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: science

by: tiago

4 Comments