|Gain Loss Mapping Engine
|HOME OVERVIEW FAQ GALLERY SOURCE CODE CITING & CREDITS|
The evolutionary analysis of presence and absence profiles (phyletic patterns) is provided in this server. It is assumed that the observed phyletic pattern is the result of gain and loss dynamics along a phylogenetic tree. Examples of characters were represented by phyletic patterns include restriction sites, gene families, introns, and indels, to name a few. This main purpose of GLOOME server is to accurately infers branch specific and site specific gain and loss events. The novel inference methodology is based on a stochastic mapping approach utilizing models that reliably capture the underlying evolutionary processes. A variety of features are available including the ability to analyze the data with various evolutionary models, to infer gain and loss events using either stochastic mapping or maximum parsimony, and to estimate gain and loss rates for each character analyzed.
Following the development of realistic probabilistic models, the analysis of phyletic patterns data has progressed from parsimony (e.g. Mirkin et al. 2003) to models, in which the dynamics of gain and loss is assumed to follow a continuous-time Markov process (Csuros 2006; Hao and Golding 2006).For the inference of branch-site specific events the parsimony criterion is still the most commonly used methodology. To overcome possible biases of the parsimony paradigm (Felsenstein 1978; Yang 1996; Pol and Siddall 2001; Swofford et al. 2001) we have recently integrated stochastic mapping approaches (Nielsen 2002; Minin and Suchard 2008) to accurately map gain (0→1) and loss (1→0) events onto each branch of a phylogenetic tree.
Such a model-based approach allows accounting for realistic biological phenomena such as variability of gain and loss rates among characters. This approach was recently shown to be robust and accurate for the inference of gene family evolutionary dynamics (Cohen and Pupko 2010).
The input to the GLOOME server consists of:
What evolutionary models are available?
The available probabilistic models range from simple to more sophisticated ones that may capture the gain and loss dynamics more reliably. There are three options for gain and loss rates: (1) "gain=loss": the probability of a gain event is assumed to be equal to that of a loss event, (2) "fixed gain/loss ratio": gain and loss probabilities may be different but the gain/loss ratio is identical across all sites, (3) "variable gain/loss ratio (mixture)": gain/loss ratio varies among sites.
Simple models assume that a single evolutionary rate characterizes all sites. The more advanced models allow for among site rate variation, assuming that the rate is either gamma distributed or gamma distributed with an additional invariant rate category.
In stationary processes the character frequencies are equal across the entire tree. GLOOME provides the option "allow the root frequencies to differ from the stationary ones" to analyze the data using non-stationary models.
A column of only '0's (the character is absent in all taxa) is usually not observable in phyletic patterns. Maximum-likelihood analyses must be corrected for such unobservable data. We allow several such corrections under the menu "correction for un-observable data". For example, If zero is selected for "Minimum number of ones" the model allows sites with only zeros to appear (do not account for un-observable data). Select more than one if singletons are also un-observable. If zero is selected for "Minimum number of zeros" the model allows sites with only ones to appear. Select one if variable sites are required (e.g., for indel data).
When to use each methodology and/or model?
The inference of events was shown to be accurate with various range of parameter and models (Cohen and Pupko 2010), thus for most users the default model and parameters may be used.
For reduced running time the user may change the following, for example: (1) under "Evolutionary model": "Rate distribution" to "Equal" or (2) under "Advanced" "Number of categories" to "2". (3) "Optimization level" to "Low" or "Very Low".
For more accurate results user may change the following, for example: (1) under "Evolutionary model": "gain & loss rates" to "Variable gain/loss ratio (mixture)" (2) under "Advanced" "Number of categories" to "4" or more. (3) "Optimization level" to "High" or "Very High".
The selection of "Correction for un-observable data" is required in order to obtain accurate results when analyzing datasets in which some patterns do not exist in the data (i.e. un-observable).
For example, in the analysis of restriction sites and gene families a pattern of only zeros (absent in all taxa) is not observable (Felsenstein 1992). For indels datasets, a pattern of only ones is also unobservable and therefore we enable under "Minimum number of zeros" to select 1.
Simulation of phyletic patterns
The simulation software is given an underlying phylogeny and a set of assumptions regarding the evolutionary dynamics of gain and loss events, parameterized as a continuous-time Markov chain. During the simulations, all gain and loss events along each branch for each site are recorded.
In a consecutive step, the resulting phyletic pattern and the evolutionary tree are given as input for both the maximum parsimony and stochastic mapping methods, which infer gain and loss events for each site and for each branch.
The stochastic mapping method assumes an evolutionary model. The model parameters and branch lengths are unknown and are estimated using maximum likelihood from the data.
The simulations are possibly conducted under several evolutionary scenarios.
Starting with a naïve scenario with equal gain and loss rates and no rate variability among different sites ("Rate equal among sites"). The assumption that gain and loss rates are equal in all sites is alleviated by sampling for each site the loss-gain rate ratio from a uniform distribution. We simulated several variants, in which we progressively introduced biases in the lost-gain rate ratio. Specifically, the loss-gain rate ratio was sampled from a uniform distribution in the interval [0 , 2 x ratio]. Thus, when ratio=1, the loss-gain ratio was sampled from the interval [0 , 2], and the expectation of the ratio is 1 (ratio is selectable under "loss/gain ratio" menu).
Additional scenarios further alleviated the assumption that all sites evolve under the same total gain+loss rate. The rate variability among sites was implemented by sampling from a gamma distribution.
The rate variability may be considered a "second layer" of variability in our implementation. We thus sampled two variables for each site: the loss-gain rate ratio (as before) and the overall evolutionary rate. For all simulations, we set the shape parameter of the gamma distribution to 0.6, which is suited for the rate variability found in gene families across microbial species.
Once the loss-gain ratio and the gain+loss rate (i.e., the total rate) for a site were determined, the gain and loss rate at that site were determined according to the following equations: gain = total rate / (1+ ratio) and loss = total rate-gain.
Additionally, the simulation dynamics can be derived from the input data instead of theoretical distributions. Thus, the gain and loss dynamics are based on real data the user input of the phyletic pattern. The gain and loss derived dynamics are either based on the stochastic mapping evaluation of events in the input data ("Stochastic mapping estimation of simulated evolutionary rated based on input data") or maximum parsimony with cost matrix assuming gain is twice the cost of loss ("Maximum parsimony estimation of simulated evolutionary rated based on input data").