We had two articles recently published:
- G. Moreno-Hagelsieb, B. Hudy-Yuffa, Estimating overannotation across prokaryotic genomes using BLAST+, UBLAST, LAST and BLAT. BMC Res Notes 7, 651 (2014).
- N. Ward, G. Moreno-Hagelsieb, Quickly Finding Orthologs as Reciprocal Best Hits with BLAT, LAST, and UBLAST: How Much Do We Miss? PLoS ONE 9, e101850 (2014).
The story goes as follows. At a talk by some group I heard that they were using UBLAST to quickly find members of some protein families rather than use a Hidden Markov Model approach. They said it was much faster, so I became curious. I downloaded USEARCH 5 back then to try and test for the things I commonly do with NCBI’s BLAST. I was surprised at how fast this program ran. In any event, I thought that testing this program for some task would be a good work for an undergrad student. That was Natalie’s undergrad thesis. Back then about using different options under USARCH to try and get as much coverage with UBLAST as with NCBI’s BLAST (UBLAST was not an option in USEARCH 5, rather, a local alignment search had to be done). We became more ambitious, and decided to test a few more programs. BLAT was something I was already playing with, while an article by Jonathan Eisen (Darling et al., PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2, e243. 2014) pointed me in LAST’s direction (besides reviewers asking for more programs to be tested).
Later on, at some other talk, I think this was a talk by Robert Beiko. He mentioned something about BLAST being too slow for some task, and I asked him why not try UBLAST. He said something to the effect of not knowing how much they might miss.
The articles we published cover one task each. One is the task of finding orthologs as reciprocal best hits. Pretty straightforward. How many orthologs are found by each program when compared to BLAST. Essentially, finding orthologs as reciprocal best hits does not require the finding of every possible match. Top matches would be enough. So, if UBLAST, for example, found just a few top matches (under version 5, we could control the number of matches found before the program stops looking), that would be enough to determine the best, and thus figure out reciprocal best hits. We though we might miss many matches, but still find most of the reciprocal best hits, and that’s what we found to be the case except between evolutionarily distant genomes (see second reference above).
For the test on overannotation, the main idea was that for that task we compare proportions, not total number of matches. Thus, if UBLAST, LAST, and BLAT missed potential homologs, but still found equivalent proportions to those found by NCBI’s BLAST, then the programs would work fine for estimating overannotation. Well, that’s what we found.
Finally, why democratic genomics? Well, tools that can run sequence comparisons in a fraction of the time that BLAST runs, and that in a desktop computer, then comparative genomics of a much larger scale becomes available for most if not all bioinformaticians. Why would I care? Well, because the most people can participate the higher the number of ideas that can make it into the field. Not everybody has access to computer clusters. There’s other avenues towards this democracy, like the availability of some precomputed homologies and orthologies. Yet, people will want to do their own tests for many reasons. From doubting the quality of existing data, to testing genomes and protein sequences not already available in databases. Maybe there’s also a good chance that genome and protein comparisons will be done via cloud computing, and be quite accessible to mere mortals. Maybe web-based tools like RAST and MG-RAST are good enough for these tasks instead of having our own thing. I don’t know. For now I think that the more options the better. These two articles are not enough. Strategies should also be developed to avoid wasting time and effort comparing sequences. As we develop our ideas and test programs, we will publish our results either in articles, or, if not enough for a publication proper, in blog entries.