Still inspired on Depth-First post The Best API May Be No API At All: PubChem and PDB, I would like to suggest a trick to reuse results of computationally intensive computations.
The concepts presented here are object serialization and persistence. The practical case will be presented in Python, but it also works in Java and most probably on other modern languages like Ruby.
Example scenario Imagine that you cross all PDB files available (I think there are around 40.000) and register the minimum distances between all irons that might exist near a protein and the protein itself. You might end up with a dictionary whose key is the PDB ID (like 1FZY) and the value is a list of minimum distance of all existing irons, to the protein, in Angstroms. From this list you might do all sorts of things like see which protein has the smallest distance to an iron, the average distance of all irons to proteins, analyze by enzyme types (by the way, you might also construct a table where, for each enzyme you record the type).
The first strategy might be to:
parse_and_process_all_PDB_files_gathering_required_information #Takes hours #Now you have: #enzyme_type a dictionary that records, for each PDB entry, # the enzyme type (if it is an enzyme) #iron_distances a dictionary that holds, for each PDB entry, # the minimum distance for each iron to the protein compute_all_sorts_of_interesting_statistics # Rather fast
This has a big drawback: every time that you want to compute new statistics you have to repeat the whole process, taking hours. Furthermore, if you are not using a local copy of the PDB database, you will be dependent on the network and stressing the servers on the other side.
I would like to propose an alternative, have 2 programs:
parse_and_process_all_PDB_files_gathering_required_information #Takes hours save_to_disk(enzime_type, iron_distances)
and
enzime_type, iron_distances = load_from_disk() compute_all_sorts_of_interesting_statistics # Rather fast
After parsing and processing (and maybe fetch it from servers) the raw data, you would save it to disk. Whenever you wanted to compute new statistics you would load the processed data from disk and do the computations without repeating the parsing and processing.
That is, you would only run the parsing and processing phase once (or whenever you need new raw information). That is the time consuming part would only be run once or very rarely.
How to do this? Complicated you think? Not at all… You just have to like pickles (in Python)…
Pickles as an healthy diet component in your Python programming
We will make use of the Pickle module
From the module itself:
The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” or “flattening”, however, to avoid confusion, the terms used here are “pickling” and “unpickling”.
OK, now continuing our example, how would we do it?
Saving enzyme_types and iron_distances is as simple as:
import pickle #You do your stuff and create enzyme_types and iron_distances pickle_file = open('data.pkl', 'wb') pickle.dump(enzyme_types, pickle_file) pickle.dump(iron_distances, pickle_file) pickle_file.close()
That is it, it is as easy as this (You can go ahead and open data.pkl with a text editor if you are so inclined).
To load the data? It is as simple:
import pickle pickle_file = open('data.pkl', 'rb') enzyme_types = pickle.load(pickle_file) iron_distances = pickle.load(pickle_file) pickle_file.close() #Compute statistics
Just one point: if you, for some reason, just want to load iron_distances (which was saved after enzyme_types) you still have to load iron_distances (for pickle to consume it, as it is saved before). On the other hand, if you just want to load enzyme_types, then you can ignore iron_distances, as it was saved after.
Caveat: you would need to create well designed intermediate structures, so that you didn’t need to run the parsing and processing phase all the time to create new intermediate structures (from my experience this is not that hard, even for people with little programming experience).
3 Comments to "Reusing results from (intensive) computations"
Please share your thoughts
Filed in: Python, bioinformatics











Do you know an equivalent for Perl ? I just dump the processed info into a tabbed text file. This requires reading the text file back into a hash. Not that it takes long to do but saving to disk a formated object sounds useful.
I am not an expert in Perl at all, but section 10.2 of “Advanced Perl Programming” seems a good overview to Perl alternatives to Python Pickle module:
http://www.unix.org.ua/orelly/perl/advprog/ch10_02.htm
Chapter 10 is conveniently called “Persistence”
Tiago, I am new to your blog and I love to see read advanced programming examples for bioinformatics. Keep up the great posts!
Pedro, I have been using YAML for serialization in Perl and Ruby:
http://www.yaml.org/
Cheers,
Adam