I was reading Depth-First’s article on The Best API May Be No API At All: PubChem and PDB and decided to relay here my experience in helping a colleague processing of PDB data.

To begin with, that person wanted to bulk analyze many (thousands) PDB files, furthermore she only knew Java and very little of it.

My suggestion was:

  1. Download all PDB files from PDB using ftp. All being the keyword here
  2. Use Python
  3. Parse the files yourself (i.e., don’t use Biopython’s Bio.PDB)

This goes in line with the idea of the “best API being no API at all” (I am not suggesting this generally, but in this case it made sense).

I suppose some justifications to a lot of counter intuitive suggestions might be in order…

For point 1: The person really wanted to analyze a lot of files in bulk, it made sense just to download them all. As far as I remember we are talking of less than 10GB. I ask myself, that, even in cases we only want to use hundreds/few thousand PDBs, this might make sense: 10GB download is not that much nowadays, it doesn’t take that much space on disk, it doesn’t take that much bandwidth. Regarding being friendly to RCSB I ask what is worse for them: A big download or many queries using CPU, databases, etc? For users, they can now query locally, and if you look at the PDB format, a few pipes of greps can go a long way and give a lot of flexibility.

For point 2: I would like to stress out that the person knew very little of Java. I contend that learning Python (with a smoother learning curve than Java) takes less time and is less frustrating (at least for users that are concerned only with results and not with the “joy of programming”) than learning/using the remaining Java plus the required system and Bio libraries (remember, Java libraries are much tougher and over engineered than Python’s).

For point 3: PDB file format is reasonably easy. Between learning a new API (which is not for free and requires understanding the API developers mind) and processing the files manually I suggested processing the files manually. This had the added benefit of making the person learning simple and very useful file processing. Please note that I am not suggesting reinventing the wheel (in fact I tend to be strongly opposed to that). But with easy file processing it seemed to make sense. I would like to say, in my defense ;) , that I suggested using the wonderful matplotlib for chart drawing and it never crossed my mind suggesting implementing a chart library from scratch.

So, sometimes, not using an existing API might be an approach worth considering.

PS - I still stand by my suggestions. Currently the person seems to have lots of questions about the chemistry of the problem. The programming problems are very rare. And I think that is the main point, computing and programming should not be the fundamental issue.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati