Author Archive

Gene content homoplasy

We just published an article, in collaboration with the laboratory of Gabriela Olmedo-Alvarez, trying to solve the phylogeny of Bacillus, which started with a special focus on Bacillus isolated from aquatic environments. The work required solving several little problems just to get those phylogenies, choosing the appropriate genes/proteins for the analyses, thinking of distance measures, etc. All of which I’ll be delighted to write about in later posts. For now, I’ll concentrate on the main finding.

In a previous article, also in collaboration with Gabriela, we had presented several phylogenies all showing that aquatic Bacillus clustered together into a single clade. Given the continuous growth of the public genome databases (see the previous two posts for example), and that Gabriela’s lab continues sequencing aquatic and other interesting Bacillus, we wondered whether the aquatic group would stand to this avalanche of new data. So there we were, choosing Bacillus and checking the data about their places of isolation. As in the previous work, we built a phylogenetic tree based on the 16S rRNA genes, a second one based on the proteins encoded by genes present in all of the genomes under analysis (a core tree), a third tree based on marker genes for phylogenomics published by Jonathan Eisen‘s group, and a hierarchical cluster based on the [dis]similarity of shared proteins between each pair of genomes that we call the Genomic Similarity Score (GSS).

The phylogenetic trees showed a clade of aquatic Bacillus, but several other aquatic Bacillus landed in other clades, thus breaking the pattern previously found. However, the GSS analysis placed the aquatic Bacillus closer together than any of the phylogenetic trees. We were surprised because, in the previous work, the GSS cluster reflected the results of the phylogenetic trees. We therefore started looking for an explanation for this discrepancy.

The phylogenetic trees are restricted to using genes or proteins shared by all the genomes under analysis, while the GSS is not limited, as it uses the similarity of all of the proteins encoded by the genes shared by each pair of genomes. Thus, we thought that there might be more genes shared between organisms of similar environments, than would be expected from their different vertical origins. After all, it is not rare for Bacteria to receive genes via horizontal gene transfer (HGT).

To test for this possibility, we proceeded to make analyses based on gene content as reflected by the classification of their encoded proteins into protein families, and the comparison of such content across organisms. We produced clusters based on gene content and, again, aquatic Bacillus were clustered better than in the phylogenetic trees. Further analyses showed some genes prevailing in groups from each environment. Most of these “environmentally-related” genes were found in strains isolated from soil, and therefore every group had some interesting genes for future studies. Among them we found genes described in previous works as being related to the appropriate environments where we found them to be enriched.

We call this apparent tendency to share more genes than expected from vertical inheritance, perhaps due to environmental constraints, gene content homoplasy.

Main reference:

  1. Hernández-González IL, Moreno-Hagelsieb G, Olmedo-Álvarez G (2018) Environmentally-driven gene content convergence and the Bacillus phylogeny. BMC Evol Biol 18: 148.

OK, more updates 2017

Well, the last post was about updates, but it was more about 2016 than 2017. Here a couple graphs for your delight. One before filtering, the other after, with the counts of prokaryotic genomes in each NCBI category as of July 2017.

The four categories are: Complete, Chromosome, Scaffold, and Contigs. My filtering used to include redundant TaxIDs, but I learned that TaxIDs wasn’t a good idea. Now I filter only by strain, substrain, etc, as provided by the NCBI list of features. Not perfect, but I seem to keep most genomes.

Enjoy.

Updates 2017

You might already know, but if you didn’t, NCBI changed the organization of its genome database. They used to have a BACTERIA directory containing all the complete genomes (with a few caveats), and a DRAFT_BACTERIA containing, well, draft genomes. Today, the genomes are scattered and organized somewhat taxonomically, so you have to look at some files to figure out if the genomes are drafty or not so drafty. Now they have four categories: Complete, Chromosome, Scaffold, and Contig. I think that’s the order of completeness, though I’m still not sure how Chromosome

genomes

Growth of genome data at NCBI

differs from Complete, but I suspect that’s what used to be the caveats (maybe only one replicon, of many, was sequenced). Anyway, last December I finished some BLASTP comparisons of a Complete genomes dataset that I downloaded by August (2016). The dataset contains 4085 complete prokaryotic genomes (I eliminated genomes from the same strains or the same taxid). Updates are thus starting to appear in the data I offer through this web site and my server at Laurier. Check frequently if you need newer data than what you found previously.

Happy new year!

Undergrad theses!

This term I have three students working on their undergrad theses, plus one working on directed studies. I am very proud of these students. Lots of initiative, reading articles, trying the computer (except for one, they hadn’t worked under unix before!), now having lots of success running their commands, and looking at results!

What are they doing? Two of them are working with protein domains in transporter proteins (from the TCDB), one on sorting prokaryotic genomes into taxonomically-coherent groups, one more on the divergence of orthologs and paralogs.

-Gabo

Half Sabbatical 2015!

I spent four great months working with Milton Saier at UCSD. Milton built a very useful database on transporter proteins, The Transporter Classification database (TCDB), and his lab has developed several pieces of software to play and analyze the database looking for such things as homologs that have diverged beyond the limits of detection by common sequence comparison tools. It was my privilege and honor to help Milton’s lab update and improve some of these tools, and develop a couple new ones. The tasks also gave me a lesson about sharing software, no matter how complex or simple.

In any event,  I’m still working on some specific projects that we started during my visit, and feel full of new ideas, for example about detection of protein domains. I expect that these ideas will complement work that’s been going on in my lab on assignment of functions to homologs with highly divergent sequences.

In short, this was a sabbatical as they should be. I learned a lot and got inspiration for new projects that I would have never thought about before this visit.

-Gabo

The whole 2015 Spring/Summer group

lab-photo-2015-reducedHere the whole group in the lab of Computational conSequences during the Spring/Summer of 2015. I’d say that this is the best group ever.

Gustavo leaves today, going back to Michoacán, Mexico after spending his sabbatical here. Julie left a few weeks ago, also back to Michoacán. She might come back for the Spring/Summer 2016.

The only locals are Brigitte, Kissa, Thomas, César and me. Brigitte and Kissa being honorary members who have been in the lab for collaborative reasons, but work for their M.Sc. degrees with other faculty members at Laurier (Michael Suits and Geoff Horsman, respectively).

We’ve been working on phages, plant-growth promoting bacteria, 16S rRNA gene analyses, metabolic annotations, gene neighborhoods, predicting gene functions, and predicting metabolism and transcriptional regulation networks. Lots of fun.

Summer 2015 group

group-2015-reducedThis is [most of] the group in the lab of Computational conSequences this summer. Several visitors from Mexico! Julie, Gustavo, and Ramiro from Michoacán, and Adrián from Mexico City.

What we’re doing?

In no particular order:

  • Julie is working with 16S rRNA genes
  • Gustavo is on sabbatical doing all kinds of reviews and such on plant growth promoting bacteria
  • Ramiro is working on the genome of a plant growth promoting bacteria
  • Adrián is working on Phage
  • Kissa is working with adjacent genes (gene neighborhoods)
  • Harold is working on genome annotations
  • Thomas is working on predicted functions (metabolism and such)
  • César is working on regulatory networks in prokaryotes, and on metagenome annotations

What’s true for E. coli is true of an elephant

The quote by Jacque Monod in the title celebrates our recent publication of an article suggesting that our previous results in Escherichia coli hold true for most other prokaryotes:

  • del Grande, M., & Moreno-Hagelsieb, G. (2014). The loose evolutionary relationships between transcription factors and other gene products across prokaryotes. BMC Research Notes, 7, 928. doi:10.1186/1756-0500-7-928

This article expands on the part about transcriptions factors presented in our previous study comparing the conservation of different kinds of functional associations:

  • Moreno-Hagelsieb, G., & Jokic, P. (2012). The evolutionary dynamics of functional modules and the extraordinary plasticity of regulons: the Escherichia coli perspective. Nucleic Acids Research, 40(15), 7104–7112. doi:10.1093/nar/gks443

The earlier article dealt with several experimentally-confirmed functional interactions determined in Escherichia coli: genes in operons, genes whose products physically interact, genes regulated by the same transcription factor (regulons), and genes coding for transcription factors and their regulated genes. In that study we found that the associations involving transcription factors tend to be much less conserved than any of the other associations studied. Our work is not the first to suggest this lack of conservation, but is the first to compare conservation across different kinds of associations, and thus show that those mediated by transcriptional regulation are the least conserved.

The most recent article was an expansion of the association between genes coding for transcription factors and other genes. The idea being to extend the study towards as many other prokaryotes as possible. But how could we determine conservation between genes coding for transcription factors and other genes without experimentally-determined interactions? We knew that at least some transcription factors could be predicted from their possessing a DNA binding domain. But what about their associations? Our prior experience has been that target genes are hard to predict even when there’s information on some characterized binding sites (sites that we like calling operators for tradition’s sake). So what to do if we have only the transcription factors? Well, to answer that we should first explain how we measured relative evolutionary conservation.

To measure evolutionary conservation we used a measure of co-occurrence called mutual information. For any two genes, the higher the mutual information, the less the observed co-occurrence looks random. Since we obtained mutual information scores for all gene pairs in the genomes we analyzed, we decided that instead of something as hard as predicting operators, and matching them to predicted transcription factors, we could use top scoring gene pairs as representatives of the most conserved interaction between our predicted transcription factors and anything else. This allowed us to compare the most conserved interactions involving transcription factors against the conservation of other interactions. Our findings suggest that interactions involving transcription factors evolve quickly in most-if-not-all of the genomes analyzed.

Please read the articles for more details and information.

-Gabo

Democratic genomics

We had two articles recently published:

  1. G. Moreno-Hagelsieb, B. Hudy-Yuffa, Estimating overannotation across prokaryotic genomes using BLAST+, UBLAST, LAST and BLAT. BMC Res Notes 7, 651 (2014).
  2. N. Ward, G. Moreno-Hagelsieb, Quickly Finding Orthologs as Reciprocal Best Hits with BLAT, LAST, and UBLAST: How Much Do We Miss? PLoS ONE 9, e101850 (2014).

The story goes as follows. At a talk by some group I heard that they were using UBLAST to quickly find members of some protein families rather than use a Hidden Markov Model approach. They said it was much faster, so I became curious. I downloaded USEARCH 5 back then to try and test for the things I commonly do with NCBI’s BLAST. I was surprised at how fast this program ran. In any event, I thought that testing this program for some task would be a good work for an undergrad student. That was Natalie’s undergrad thesis. Back then about using different options under USARCH to try and get as much coverage with UBLAST as with NCBI’s BLAST (UBLAST was not an option in USEARCH 5, rather, a local alignment search had to be done). We became more ambitious, and decided to test a few more programs. BLAT was something I was already playing with, while an article by Jonathan Eisen (Darling et al., PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2, e243. 2014) pointed me in LAST’s direction (besides reviewers asking for more programs to be tested).

Later on, at some other talk, I think this was a talk by Robert Beiko. He mentioned something about BLAST being too slow for some task, and I asked him why not try UBLAST. He said something to the effect of not knowing how much they might miss.

The articles we published cover one task each. One is the task of finding orthologs as reciprocal best hits. Pretty straightforward. How many orthologs are found by each program when compared to BLAST. Essentially, finding orthologs as reciprocal best hits does not require the finding of every possible match. Top matches would be enough. So, if UBLAST, for example, found just a few top matches (under version 5, we could control the number of matches found  before the program stops looking), that would be enough to determine the best, and thus figure out reciprocal best hits. We though we might miss many matches, but still find most of the reciprocal best hits, and that’s what we found to be the case except between evolutionarily distant genomes (see second reference above).

For the test on overannotation, the main idea was that for that task we compare proportions, not total number of matches. Thus, if UBLAST, LAST, and BLAT missed potential homologs, but still found equivalent proportions to those found by NCBI’s BLAST, then the programs would work fine for estimating overannotation. Well, that’s what we found.

Finally, why democratic genomics? Well, tools that can run sequence comparisons in a fraction of the time that BLAST runs, and that in a desktop computer, then comparative genomics of a much larger scale becomes available for most if not all bioinformaticians. Why would I care? Well, because the most people can participate the higher the number of ideas that can make it into the field. Not everybody has access to computer clusters. There’s other avenues towards this democracy, like the availability of some precomputed homologies and orthologies. Yet, people will want to do their own tests for many reasons. From doubting the quality of existing data, to testing genomes and protein sequences not already available in databases. Maybe there’s also a good chance that genome and protein comparisons will be done via cloud computing, and be quite accessible to mere mortals. Maybe web-based tools like RAST and MG-RAST are good enough for these tasks instead of having our own thing. I don’t know. For now I think that the more options the better. These two articles are not enough. Strategies should also be developed to avoid wasting time and effort comparing sequences. As we develop our ideas and test programs, we will publish our results either in articles, or, if not enough for a publication proper, in blog entries.

Have fun!

-Gabo

The Latinamerican bioinformatics force

20140619_145020

The Latin-American conSequences force

Since Julie was leaving on Saturday, those present in the lab last Thursday had lunch together.

Julie is a PhD student co-supervised by me and Dr. Santoyo. She came from Mexico for a few months to learn some bioinformatics that she will apply to her PhD project on the rhizospheric microbiome associated to a few crops.

See ya later Julie!

%d bloggers like this: