Archive for the ‘ bioinformatics ’ Category

The whole 2015 Spring/Summer group

lab-photo-2015-reducedHere the whole group in the lab of Computational conSequences during the Spring/Summer of 2015. I’d say that this is the best group ever.

Gustavo leaves today, going back to Michoacán, Mexico after spending his sabbatical here. Julie left a few weeks ago, also back to Michoacán. She might come back for the Spring/Summer 2016.

The only locals are Brigitte, Kissa, Thomas, César and me. Brigitte and Kissa being honorary members who have been in the lab for collaborative reasons, but work for their M.Sc. degrees with other faculty members at Laurier (Michael Suits and Geoff Horsman, respectively).

We’ve been working on phages, plant-growth promoting bacteria, 16S rRNA gene analyses, metabolic annotations, gene neighborhoods, predicting gene functions, and predicting metabolism and transcriptional regulation networks. Lots of fun.

Democratic genomics

We had two articles recently published:

  1. G. Moreno-Hagelsieb, B. Hudy-Yuffa, Estimating overannotation across prokaryotic genomes using BLAST+, UBLAST, LAST and BLAT. BMC Res Notes 7, 651 (2014).
  2. N. Ward, G. Moreno-Hagelsieb, Quickly Finding Orthologs as Reciprocal Best Hits with BLAT, LAST, and UBLAST: How Much Do We Miss? PLoS ONE 9, e101850 (2014).

The story goes as follows. At a talk by some group I heard that they were using UBLAST to quickly find members of some protein families rather than use a Hidden Markov Model approach. They said it was much faster, so I became curious. I downloaded USEARCH 5 back then to try and test for the things I commonly do with NCBI’s BLAST. I was surprised at how fast this program ran. In any event, I thought that testing this program for some task would be a good work for an undergrad student. That was Natalie’s undergrad thesis. Back then about using different options under USARCH to try and get as much coverage with UBLAST as with NCBI’s BLAST (UBLAST was not an option in USEARCH 5, rather, a local alignment search had to be done). We became more ambitious, and decided to test a few more programs. BLAT was something I was already playing with, while an article by Jonathan Eisen (Darling et al., PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2, e243. 2014) pointed me in LAST’s direction (besides reviewers asking for more programs to be tested).

Later on, at some other talk, I think this was a talk by Robert Beiko. He mentioned something about BLAST being too slow for some task, and I asked him why not try UBLAST. He said something to the effect of not knowing how much they might miss.

The articles we published cover one task each. One is the task of finding orthologs as reciprocal best hits. Pretty straightforward. How many orthologs are found by each program when compared to BLAST. Essentially, finding orthologs as reciprocal best hits does not require the finding of every possible match. Top matches would be enough. So, if UBLAST, for example, found just a few top matches (under version 5, we could control the number of matches found  before the program stops looking), that would be enough to determine the best, and thus figure out reciprocal best hits. We though we might miss many matches, but still find most of the reciprocal best hits, and that’s what we found to be the case except between evolutionarily distant genomes (see second reference above).

For the test on overannotation, the main idea was that for that task we compare proportions, not total number of matches. Thus, if UBLAST, LAST, and BLAT missed potential homologs, but still found equivalent proportions to those found by NCBI’s BLAST, then the programs would work fine for estimating overannotation. Well, that’s what we found.

Finally, why democratic genomics? Well, tools that can run sequence comparisons in a fraction of the time that BLAST runs, and that in a desktop computer, then comparative genomics of a much larger scale becomes available for most if not all bioinformaticians. Why would I care? Well, because the most people can participate the higher the number of ideas that can make it into the field. Not everybody has access to computer clusters. There’s other avenues towards this democracy, like the availability of some precomputed homologies and orthologies. Yet, people will want to do their own tests for many reasons. From doubting the quality of existing data, to testing genomes and protein sequences not already available in databases. Maybe there’s also a good chance that genome and protein comparisons will be done via cloud computing, and be quite accessible to mere mortals. Maybe web-based tools like RAST and MG-RAST are good enough for these tasks instead of having our own thing. I don’t know. For now I think that the more options the better. These two articles are not enough. Strategies should also be developed to avoid wasting time and effort comparing sequences. As we develop our ideas and test programs, we will publish our results either in articles, or, if not enough for a publication proper, in blog entries.

Have fun!

-Gabo

CSM 2013

Several members of The Lab of Computational conSequences went to the Canadian Society of Microbiologists conference in Ottawa last week: Lisa, Jenny, Marc, Scott, and honorary members Mike Lynch, and Laura (Lisa’s sister). All of them presented posters, Jenny gave her first talk in a scientific conference, and Mike gave a talk that I missed on exploring “the rare biosphere” (your homework to figure out what that means).

Posters were successful, Marc, who is working on the evolution of regulation of transcription by, ahem, transcription factors, had lots of visitors, the twins (Lisa and Laura) presented work on the gene cluster for cellulose biosynthesis in Bacteria, Jenny talked about 16S rRNA genes, and Scott presented a bit about phage and horizontal gene transfer.

We shall talk about these projects some time soon. We are preparing several articles and will post something about them as they are finished and submitted.

Have fun!

Non-redundant prokaryotic genomes

We just had an Applications Note accepted in Bioinformatics. The little note presents a tool we develped to choose sets of non-redundant prokaryotic genomes (see Research-Genome Clusters too).

The tool derives from previous work where we selected sets of non-redundant prokaryotic genomes filtered at different levels of similarity for such tasks as displaying results on operon predictions, to finding the level of filtering out redundancy to maximize the number of high-quality predicted associations by phylogenetic profiles. Other groups have been using our non-redundant sets. Thus, we thought it was better to share to the wider community and we developed this tool. If you have suggestions for improvements, please let us know. We cannot promise to implement all suggestions, but we will try to make the tool very useful. Also note that the R-scripts used to produce these datasets are provided (as is). These might help you develop your own datasets if so you require.

-SuperGabo

Computational Genomics and Metagenomics

ArgR

Network of functional interactions for the arginine repressor

Welcome to the web page of the lab of Computational Genomics and Metagenomics, a.k.a. the lab of Computational Microbiology, the lab of Computational Microbiomics, and the lab of Computational Con-Sequences.

We are interested in all things genomic, metagenomic, postgenomic, postmetagenomic, and hyperultramegasupragenomic (!). Our work centres around the evolution and the inference both of function and of functional interactions of gene products, mostly in Prokaryotes.

Everything in this lab is done with computers. Yet, besides working with other computational biologists, we also have collaborations with wet labs.

You might be wondering how this kind of research got started. Well, it all began with the idea that we should stop finding the genes in the human genome, one by one, by laborious and intense work linking phenotypes (what we see) to finding the very gene, or genes (what we don’t see), responsible for such phenotypes. Not that such work is not valuable, au contraire, without that work providing us with real life examples we would not be in any position for making sense of genome sequences. It was probably some kind of a case of impatience [and boldness]. Of course, there is also the tiny detail that knowing our complete genetic complement (a useful definition of “genome”) would provide us with a wider and more accessible basis for the better and faster finding of genes behind phenotypes. Now, substitute the word “phenotype” by whatever disease that might involve genetics (or badly gone genetics), such as “cancer,” or “diabetes,” and you might get a better feeling of importance for this task.

As you might guess, quite well, this ambitious project set the whole machinery in motion. Long story short (but I might try and let you know better later), the technological advancements brought about by the idea of having our beautiful 23+1/2 pairs of chromosomes sequenced allowed scientists to sequence microbes. With those genomes available, before we even had a first draft of our own, other technologies arose, technologies focused on making sense of newly found genes in those genomes. A couple of these are transcriptomics, which started with microarrays, used to find out which genes are expressed by finding their messenger RNAs; and proteomics, used to figure out which proteins are being produced. That my friends, started the “postgenomic era.” I gave it away, didn’t I? You have thus guessed that “postgenomics,” in the second paragraph, refers to the products of these new technologies, and you are absolutely right.

Well, not content with the human genome (the draft was announced in 2000, yes, ten years ago!), and with powerful sequencing technologies available, another new field arose. The field of environmental genomics, or metagenomics (the word “metagenomics” was used to mean the fishing of genes with particular functions from the environment, before it was used in this context, but let’s not go there). This thing is about sequencing fragments of DNA isolated from an environment (I was going to write “from a given environment,” but I resisted), and then guessing things about the microbes, or whatever, in such an environment. Others refer to “metagenomics” as sequencing without culturing, but let’s not go there either. Well, since now scientists are sequencing mRNA, rather than DNA, we could say that the postmetagenomic era has dawned. Though I haven’t seen this word used in a paper yet. In any event, such a humongous amount of data necessarily calls for computational analyses, and here we are.

Well, hard to guess where all this is going, but the sequencing technologies keep improving and getting cheaper. We cannot but expect the word “deluge” in all of the published papers of the genomic era to become ridiculous by comparison. This means lots of challenges to make sense of the information. Lots of new avenues of research too. This is why I am reserving the word “hyperultramegasupragenomic” (and its “post-” derivative) for later use. The way things are going, it might not be that much later.

With that, welcome again, and enjoy your visit.

-Gabo

%d bloggers like this: