Gain Loss Mapping Engine |
||

HOME OVERVIEW FAQ GALLERY SOURCE CODE CITING & CREDITS |

The evolutionary analysis of presence and absence profiles (phyletic patterns) is provided in this server. It is assumed that the observed phyletic pattern is the result of gain and loss dynamics along a phylogenetic tree. Examples of characters were represented by phyletic patterns include restriction sites, gene families, introns, and indels, to name a few. This main purpose of GLOOME server is to accurately infers branch specific and site specific gain and loss events. The novel inference methodology is based on a stochastic mapping approach utilizing models that reliably capture the underlying evolutionary processes. A variety of features are available including the ability to analyze the data with various evolutionary models, to infer gain and loss events using either stochastic mapping or maximum parsimony, and to estimate gain and loss rates for each character analyzed.

Numerous biological characteristics are coded using binary characters to denote presence ('1') versus absence ('0'). The 0/1 matrix is termed a phylogenetic profile of presence-absence or phyletic pattern and is equivalent to a multiple sequence alignment (MSA), in which rows correspond to species and columns corresponds to binary characters.

Following the development of realistic probabilistic models, the analysis of phyletic patterns data has progressed from parsimony (e.g. Mirkin et al. 2003) to models, in which the dynamics of gain and loss is assumed to follow a continuous-time Markov process (Csuros 2006; Hao and Golding 2006).

For the inference of branch-site specific events the parsimony criterion is still the most commonly used methodology. To overcome possible biases of the parsimony paradigm (Felsenstein 1978; Yang 1996; Pol and Siddall 2001; Swofford et al. 2001) we have recently integrated stochastic mapping approaches (Nielsen 2002; Minin and Suchard 2008) to accurately map gain (0→1) and loss (1→0) events onto each branch of a phylogenetic tree.Such a model-based approach allows accounting for realistic biological phenomena such as variability of gain and loss rates among characters. This approach was recently shown to be robust and accurate for the inference of gene family evolutionary dynamics (Cohen and Pupko 2010).

**Input**

The input to the GLOOME server consists of:

0/1 sequences (phyletic pattern). The sequences should be in FASTA format only. Other sequence file formats such as Clustal and Phylip may be converted to FASTA using software such as READSEQ.**Minimal:**A phylogenetic tree in NEWICK format. If the tree is not provided by the user, it will be estimated from the phyletic pattern using model based distance estimation and neighbor joining.__Optional:__

GLOOME directs you to a web page called "GLOOME Job Status Page". This web page is automatically updated every 30 seconds, showing messages regarding the different stages of the server activity. When the calculation finishes, several links appear. For an example output page click here.

**MSA Colored according to the probability of events**

This link is the main link for the GLOOME output, which is a projection of the stochastic mapping computation scores of each site onto the MSA, using a color-scale. Shades indicate the probability of event occurring in the specific site and branch leading to the specific taxa. A separate color-coded MSA is used for gain events (Figure 1A) and loss events (Figure 1B). In addition, the expected number of events over all branches are plotted below the alignment for each site. This information is also provided textually with additional files for sum over branches or over sites.__Parsimony detection of events__

GLOOME also allows the inference of gain and loss events under the parsimony criterion. The relative costs of gain and loss events can be determined by the user. For example, select cost of gain=2, if the gain events are twice as costly as loss events.The posterior estimation of the relative rate of each site (overall events) and Separate estimation of the gain and loss rates for each site (mixture model only).__Rate per site__The tree and its associated branch lengths estimated from the phyletic pattern. A Java applet is available for tree visualization and manipulation (Figure 2).__Tree__

The stochastic mapping approach (Nielsen 2002; Minin and Suchard 2008) is used to map gain and loss events onto each branch of a phylogenetic tree. The method is based on the probabilistic framework. Given the evolutionary model, for each branch the expectation and probability of events are computed for each possible scenario of character states at the beginning and the end of the branch (Figure 3).

__What evolutionary models are available?__

The available probabilistic models range from simple to more sophisticated ones that may capture the gain and loss dynamics more reliably. There are three options for gain and loss rates: (1) "gain=loss": the probability of a gain event is assumed to be equal to that of a loss event, (2) "fixed gain/loss ratio": gain and loss probabilities may be different but the gain/loss ratio is identical across all sites, (3) "variable gain/loss ratio (mixture)": gain/loss ratio varies among sites.

Simple models assume that a single evolutionary rate characterizes all sites. The more advanced models allow for among site rate variation, assuming that the rate is either gamma distributed or gamma distributed with an additional invariant rate category.

In stationary processes the character frequencies are equal across the entire tree. GLOOME provides the option "allow the root frequencies to differ from the stationary ones" to analyze the data using non-stationary models.

A column of only '0's (the character is absent in all taxa) is usually not observable in phyletic patterns. Maximum-likelihood analyses must be corrected for such unobservable data. We allow several such corrections under the menu "correction for un-observable data". For example, If zero is selected for "Minimum number of ones" the model allows sites with only zeros to appear (do not account for un-observable data). Select more than one if singletons are also un-observable. If zero is selected for "Minimum number of zeros" the model allows sites with only ones to appear. Select one if variable sites are required (e.g., for indel data).

__When to use each methodology and/or model?__

The inference of events was shown to be accurate with various range of parameter and models (Cohen and Pupko 2010), thus for most users the default model and parameters may be used.

For reduced running time the user may change the following, for example: (1) under "Evolutionary model": "Rate distribution" to "Equal" or (2) under "Advanced" "Number of categories" to "2". (3) "Optimization level" to "Low" or "Very Low".

For more accurate results user may change the following, for example: (1) under "Evolutionary model": "gain & loss rates" to "Variable gain/loss ratio (mixture)" (2) under "Advanced" "Number of categories" to "4" or more. (3) "Optimization level" to "High" or "Very High".

The selection of "Correction for un-observable data" is required in order to obtain accurate results when analyzing datasets in which some patterns do not exist in the data (i.e. un-observable).

For example, in the analysis of restriction sites and gene families a pattern of only zeros (absent in all taxa) is not observable (Felsenstein 1992). For indels datasets, a pattern of only ones is also unobservable and therefore we enable under "Minimum number of zeros" to select 1.

__Simulation of phyletic patterns__

The simulation software is given an underlying phylogeny and a set of assumptions regarding the evolutionary dynamics of gain and loss events, parameterized as a continuous-time Markov chain. During the simulations, all gain and loss events along each branch for each site are recorded.

In a consecutive step, the resulting phyletic pattern and the evolutionary tree are given as input for both the maximum parsimony and stochastic mapping methods, which infer gain and loss events for each site and for each branch.

The stochastic mapping method assumes an evolutionary model. The model parameters and branch lengths are unknown and are estimated using maximum likelihood from the data.

The simulations are possibly conducted under several evolutionary scenarios.

Starting with a naïve scenario with equal gain and loss rates and no rate variability among different sites ("Rate equal among sites").
The assumption that gain and loss rates are equal in all sites is alleviated by sampling for each site the loss-gain rate ratio from a uniform distribution.
We simulated several variants, in which we progressively introduced biases in the lost-gain rate ratio. Specifically, the loss-gain rate ratio was sampled from a uniform distribution in the interval [0 , 2 x ratio]. Thus, when ratio=1, the loss-gain ratio was sampled from the interval [0 , 2], and the expectation of the ratio is 1 (ratio is selectable under "loss/gain ratio" menu).

Additional scenarios further alleviated the assumption that all sites evolve under the same total gain+loss rate. The rate variability among sites was implemented by sampling from a gamma distribution.

The rate variability may be considered a "second layer" of variability in our implementation. We thus sampled two variables for each site: the loss-gain rate ratio (as before) and the overall evolutionary rate. For all simulations, we set the shape parameter of the gamma distribution to 0.6, which is suited for the rate variability found in gene families across microbial species.

Once the loss-gain ratio and the gain+loss rate (i.e., the total rate) for a site were determined, the gain and loss rate at that site were determined according to the following equations: gain = total rate / (1+ ratio) and loss = total rate-gain.

Additionally, the simulation dynamics can be derived from the input data instead of theoretical distributions. Thus, the gain and loss dynamics are based on real data the user input of the phyletic pattern. The gain and loss derived dynamics are either based on the stochastic mapping evaluation of events in the input data ("Stochastic mapping estimation of simulated evolutionary rated based on input data") or maximum parsimony with cost matrix assuming gain is twice the cost of loss ("Maximum parsimony estimation of simulated evolutionary rated based on input data").

**References**

- Cohen, O., and T. Pupko. 2010.
**Inference and characterization of horizontally transferred gene families using stochastic mapping.***Mol Biol Evol*27:703-713. - Csuros, M. 2006.
**On the estimation of intron evolution.***PLoS Comput Biol*2:e84. - Felsenstein, J. 1978.
**Cases in which parsimony or compatibility methods will be positively misleading.***Syst Biol*27:401-410. - Felsenstein, J. 1992.
**Phylogenies from restriction sites: A maximum-likelihood approach.***Evolution*46:159-173. - Hao, W., and G. B. Golding. 2006.
**The fate of laterally transferred genes: life in the fast lane to adaptation or death.***Genome Res*16:636-643. - Minin, V. N., and M. A. Suchard. 2008.
**Counting labeled transitions in continuous-time Markov models of evolution.***J Math Biol*56:391-412. - Mirkin, B. G., T. I. Fenner, M. Y. Galperin, and E. V. Koonin. 2003.
**Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes.***BMC Evol Biol*3:2. - Nielsen, R. 2002.
**Mapping mutations on phylogenies.***Syst Biol*51:729-739. - Pol, D., and M. E. Siddall. 2001.
**Biases in maximum likelihood and parsimony: a simulation approach to a 10-taxon case.***Cladistics*17:266-281. - Swofford, D. L., P. J. Waddell, J. P. Huelsenbeck, P. G. Foster, P. O. Lewis, and J. S. Rogers. 2001.
**Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods.***Syst Biol*50:525-539. - Yang, Z. 1996.
**Phylogenetic analysis using parsimony and likelihood methods.***J Mol Evol*42:294-307.