GLOOME Logo Gain Loss Mapping Engine

GLOOME Overview


The evolutionary analysis of presence and absence profiles (phyletic patterns) is provided in this server. It is assumed that the observed phyletic pattern is the result of gain and loss dynamics along a phylogenetic tree. Examples of characters were represented by phyletic patterns include restriction sites, gene families, introns, and indels, to name a few. This main purpose of GLOOME server is to accurately infers branch specific and site specific gain and loss events. The novel inference methodology is based on a stochastic mapping approach utilizing models that reliably capture the underlying evolutionary processes. A variety of features are available including the ability to analyze the data with various evolutionary models, to infer gain and loss events using either stochastic mapping or maximum parsimony, and to estimate gain and loss rates for each character analyzed.

Numerous biological characteristics are coded using binary characters to denote presence ('1') versus absence ('0'). The 0/1 matrix is termed a phylogenetic profile of presence-absence or phyletic pattern and is equivalent to a multiple sequence alignment (MSA), in which rows correspond to species and columns corresponds to binary characters.

Following the development of realistic probabilistic models, the analysis of phyletic patterns data has progressed from parsimony (e.g. Mirkin et al. 2003) to models, in which the dynamics of gain and loss is assumed to follow a continuous-time Markov process (Csuros 2006; Hao and Golding 2006).

For the inference of branch-site specific events the parsimony criterion is still the most commonly used methodology. To overcome possible biases of the parsimony paradigm (Felsenstein 1978; Yang 1996; Pol and Siddall 2001; Swofford et al. 2001) we have recently integrated stochastic mapping approaches (Nielsen 2002; Minin and Suchard 2008) to accurately map gain (0→1) and loss (1→0) events onto each branch of a phylogenetic tree.

Such a model-based approach allows accounting for realistic biological phenomena such as variability of gain and loss rates among characters. This approach was recently shown to be robust and accurate for the inference of gene family evolutionary dynamics (Cohen and Pupko 2010).

The input to the GLOOME server consists of:

  1. Minimal: 0/1 sequences (phyletic pattern). The sequences should be in FASTA format only. Other sequence file formats such as Clustal and Phylip may be converted to FASTA using software such as READSEQ.
  2. Optional: A phylogenetic tree in NEWICK format. If the tree is not provided by the user, it will be estimated from the phyletic pattern using model based distance estimation and neighbor joining.

GLOOME directs you to a web page called "GLOOME Job Status Page". This web page is automatically updated every 30 seconds, showing messages regarding the different stages of the server activity. When the calculation finishes, several links appear. For an example output page click here.

  • MSA Colored according to the probability of events
    This link is the main link for the GLOOME output, which is a projection of the stochastic mapping computation scores of each site onto the MSA, using a color-scale. Shades indicate the probability of event occurring in the specific site and branch leading to the specific taxa. A separate color-coded MSA is used for gain events (Figure 1A) and loss events (Figure 1B). In addition, the expected number of events over all branches are plotted below the alignment for each site. This information is also provided textually with additional files for sum over branches or over sites.

    • FIGURE 1 Stochastic mapping inference of events. Each character in the phyletic pattern is color coded according to the probability of an event in this site and within the branch leading to this species. The size of the bar below each site indicates the sum of expected events over all branches. (A) gain events projection (B) loss events projection.

  • Parsimony detection of events
    GLOOME also allows the inference of gain and loss events under the parsimony criterion. The relative costs of gain and loss events can be determined by the user. For example, select cost of gain=2, if the gain events are twice as costly as loss events.

  • Rate per site The posterior estimation of the relative rate of each site (overall events) and Separate estimation of the gain and loss rates for each site (mixture model only).

  • Tree The tree and its associated branch lengths estimated from the phyletic pattern. A Java applet is available for tree visualization and manipulation (Figure 2).

      FIGURE 2 The tree visualized using the Java applet.

    What is stochastic mapping?
    The stochastic mapping approach (Nielsen 2002; Minin and Suchard 2008) is used to map gain and loss events onto each branch of a phylogenetic tree. The method is based on the probabilistic framework. Given the evolutionary model, for each branch the expectation and probability of events are computed for each possible scenario of character states at the beginning and the end of the branch (Figure 3).

      FIGURE 3 Stochastic mapping a toy example. Shown is the stochastic mapping computation of the posterior expectation of the number of gain events for the branch connecting nodes N1 and N2. The total expectation equal 0.53 and is computed as the weighted sum over four scenarios: N1=0 and N2=0, N1=0 and N2=1, N1=1 and N2=0, and N1=1 and N2=1.

    What evolutionary models are available?
    The available probabilistic models range from simple to more sophisticated ones that may capture the gain and loss dynamics more reliably. There are three options for gain and loss rates: (1) "gain=loss": the probability of a gain event is assumed to be equal to that of a loss event, (2) "fixed gain/loss ratio": gain and loss probabilities may be different but the gain/loss ratio is identical across all sites, (3) "variable gain/loss ratio (mixture)": gain/loss ratio varies among sites.

    Simple models assume that a single evolutionary rate characterizes all sites. The more advanced models allow for among site rate variation, assuming that the rate is either gamma distributed or gamma distributed with an additional invariant rate category.

    In stationary processes the character frequencies are equal across the entire tree. GLOOME provides the option "allow the root frequencies to differ from the stationary ones" to analyze the data using non-stationary models.

    A column of only '0's (the character is absent in all taxa) is usually not observable in phyletic patterns. Maximum-likelihood analyses must be corrected for such unobservable data. We allow several such corrections under the menu "correction for un-observable data". For example, If zero is selected for "Minimum number of ones" the model allows sites with only zeros to appear (do not account for un-observable data). Select more than one if singletons are also un-observable. If zero is selected for "Minimum number of zeros" the model allows sites with only ones to appear. Select one if variable sites are required (e.g., for indel data).

    When to use each methodology and/or model?
    The inference of events was shown to be accurate with various range of parameter and models (
    Cohen and Pupko 2010), thus for most users the default model and parameters may be used.

    For reduced running time the user may change the following, for example: (1) under "Evolutionary model": "Rate distribution" to "Equal" or (2) under "Advanced" "Number of categories" to "2". (3) "Optimization level" to "Low" or "Very Low".

    For more accurate results user may change the following, for example: (1) under "Evolutionary model": "gain & loss rates" to "Variable gain/loss ratio (mixture)" (2) under "Advanced" "Number of categories" to "4" or more. (3) "Optimization level" to "High" or "Very High".

    The selection of "Correction for un-observable data" is required in order to obtain accurate results when analyzing datasets in which some patterns do not exist in the data (i.e. un-observable).

    For example, in the analysis of restriction sites and gene families a pattern of only zeros (absent in all taxa) is not observable (Felsenstein 1992). For indels datasets, a pattern of only ones is also unobservable and therefore we enable under "Minimum number of zeros" to select 1.

    Simulation of phyletic patterns

    The simulation software is given an underlying phylogeny and a set of assumptions regarding the evolutionary dynamics of gain and loss events, parameterized as a continuous-time Markov chain. During the simulations, all gain and loss events along each branch for each site are recorded.
    In a consecutive step, the resulting phyletic pattern and the evolutionary tree are given as input for both the maximum parsimony and stochastic mapping methods, which infer gain and loss events for each site and for each branch.
    The stochastic mapping method assumes an evolutionary model. The model parameters and branch lengths are unknown and are estimated using maximum likelihood from the data.

    The simulations are possibly conducted under several evolutionary scenarios.
    Starting with a naïve scenario with equal gain and loss rates and no rate variability among different sites ("Rate equal among sites"). The assumption that gain and loss rates are equal in all sites is alleviated by sampling for each site the loss-gain rate ratio from a uniform distribution. We simulated several variants, in which we progressively introduced biases in the lost-gain rate ratio. Specifically, the loss-gain rate ratio was sampled from a uniform distribution in the interval [0 , 2 x ratio]. Thus, when ratio=1, the loss-gain ratio was sampled from the interval [0 , 2], and the expectation of the ratio is 1 (ratio is selectable under "loss/gain ratio" menu).

    Additional scenarios further alleviated the assumption that all sites evolve under the same total gain+loss rate. The rate variability among sites was implemented by sampling from a gamma distribution.

    The rate variability may be considered a "second layer" of variability in our implementation. We thus sampled two variables for each site: the loss-gain rate ratio (as before) and the overall evolutionary rate. For all simulations, we set the shape parameter of the gamma distribution to 0.6, which is suited for the rate variability found in gene families across microbial species.

    Once the loss-gain ratio and the gain+loss rate (i.e., the total rate) for a site were determined, the gain and loss rate at that site were determined according to the following equations: gain = total rate / (1+ ratio) and loss = total rate-gain.

    Additionally, the simulation dynamics can be derived from the input data instead of theoretical distributions. Thus, the gain and loss dynamics are based on real data the user input of the phyletic pattern. The gain and loss derived dynamics are either based on the stochastic mapping evaluation of events in the input data ("Stochastic mapping estimation of simulated evolutionary rated based on input data") or maximum parsimony with cost matrix assuming gain is twice the cost of loss ("Maximum parsimony estimation of simulated evolutionary rated based on input data").


To the top