hfv top banner
HFV Ebola sequence database

Hypermut Explanation

What is hypermutation?

A retroviral provirus is dubbed a "hypermutant" if it undergoes an inordinate number of identical transitions (usually Guanine --> Adenine). Hypermutation usually results in the production of replication-incompetent virus due to the introduction of new stop codons. Identifying hypermutants in a patient's viral population can be critical when reconstructing viral phylogenies (to assess the effects of drug therapy, immune surveillance, etc.). The apparent rate of viral evolution can be dramatically exaggerated by hypermutant sequences, when in actuality these viruses are evolutionary dead ends; their profound divergence is an artifact of a single aberrant round of replication.

For an older perspective on hypermutation, please read Simon Wain-Hobson's article, Retroviral G --> A Hypermutation, for a review of this phenomenon.

Hypermutation is thought to be caused by a host cellular defense mechanism that induces mutations in reverse transcribed nascent retroviral DNA. Host lymphocytes express 2 proteins of the apolipoprotein B mRNA editing complex family, APOBEC3F and APOBEC3G. These enzymes have slightly different substrate specificities, but both produce G->A transitions. The HIV Vif protein blocks this process. When Vif is defective, hypermutation is the result. For review, see:

Tip: For a simple "quick and dirty" scan for hypermutants, we recommend the Quality Control Tool. The QC tool includes a simplified version of Hypermut that uses the M group consensus sequence as the reference, so you don't need to provide a reference. The results may be somewhat less accurate, but the tool is easy to run and may help you detect other common problems, in addition to hypermutation.

Box 1: Sequence Input

The Hypermut Program takes a nucleotide alignment and documents the nature and context of nucleotide substitutions in a sequence population relative to a reference sequence.

To run the program successfully (1) select the alignment format you are using to present your sequences. (2) Input your sequence alignment file. The program designates the first sequence in the file as the reference sequence and considers all other sequences as queries to compare to the reference. Please choose the reference sequence carefully. For example, for an intrapatient set, the reference should probably represent the most common form in the first sampled time point, and for a set of unrelated sequences it is best to use a consensus sequence for the appropriate subtype. (3) Then choose whether to view the complete sequence (leave boxes blank) or a subregion. If you choose to view a subregion, enter the range of the desired subregion in the boxes.

Box 2: Hypermut 2.0 Customized Options

Hypermut performs 2 different analyses: Original and 2.0. The original Hypermut calculations considered G->A transitions only in terms of tallying changes considering the context of the base following the G; Hypermut 1.0 was written prior to the discovery of APOBEC, and at that time it was known that G -> A hypermutation tended to occur when a G was either followed by a G or an A, but it was not known why this was the case. Hypermut 2.0 is designed to look specifically at APOBEC-induced transitions. The options in this box apply only to the Hypermut 2.0 analysis, and have no effect on the Original Hypermut output. For typical analyses of APOBEC-induced hypermutation, these options should be left in their default settings. For more detailed analysis, you can edit the mutation pattern in the provided boxes to search for any desired pattern. Read Hypermut 2.0 Details for more information about specific context patterns and what they mean.

As in regular expressions, the symbol "|" means "OR". Thus GGT|GAA matches GGT or GAA. () can be used for grouping (i.e., one could also write G(GT|AA). All of the IUPAC codes are supported (e.g., R means G or A, while D means Not-C and a vertical bar ("|") means "OR".

For technical reasons, the upstream context pattern must always match a fixed number of nucleotides. For example, A|(TC) is not allowed as an upstream pattern because it could have length 1 or 2. The same requirement holds for the mutation pattern, which is normally just one character anyway, but fixed length patterns (of reasonable length) should work fine.

Box 3: Output options

You can obtain the analyses provided by the Original Hypermut tool by clicking "Original". This analysis only considers G->A transitions, without regard for context. For more specific analysis of APOBEC-induced hypermutation, with additional output options such as p-value, choose "Hypermut 2.0".

Hypermut-Original output

Hypermut Original output contains 4 parts: 1) a data sheet summarizing the hypermutations, 2)an xy plot providing an illustrated overview of all the sequences and their nucleotide changes, 3) a graph depicting ALL mutations in a selected sequence, and 4) a table for allowing quick analysis of stop codon mutations. The program has two options to allow either an overview of the complete sequence, or a detailed view of a subregion. The hypermutational changes are color coded,red=GG>AG, cyan=GA>AA, green=GC>AC, magenta=GT>AT, all non-G-to-A-transitions are indicated in black, and gaps are yellow.

The first page presented after submitting the sequence is the data sheet. This file contains a summary of the analysis. It displays the number of sequences in the input alignment, the name, overall length (or selected region length) and base composition of the reference sequence (below are footnotes for the 8 columns). From here you have 3 options to continue with the analysis. In front of the sequences are two sets of radio buttons. From left to right, the first set will let you choose a sequence of interest and display all mutations occurring in a sequence compared to the reference sequence. The second set of buttons lets you again choose a sequence of interest, however this time it displays a table with a tally of all stop codons within a certain frame (this added option to the Hypermut interface was suggested by Jerry Learn, University of Washington, Seattle).

The purpose of the framereader option is to highlight mutations that have caused aa codons to be converted into stop codons, e.g., tryptophan codons (UGG) to stops (UAG, UAA or UGA). It is assumed that a stop codon conversion in the normal protein reading frame will create a nonfunctional gene. The program tallies all stop codons in a specified reading frame or all three frames and highlights them in the sequence. The first sequence, designated the reference sequence, provides for a comparison to see where stop codon mutations have occurred. All guanine nucleotides are indicated in red. The program presents a table with the number of counted stop codons in the specified frame. Below the reference sequence, the sequence of interest is printed to the screen in its respective frame to allow a comparison of hypermutations and stop codons. The stop codons are highlighted in bold and shaded in red.

The 3rd option in the Hypermut-Original output allows you to view a graphical output of all sequences compared to the reference sequence. It graphically displays the relative physical location of each mutation, represented by hash marks in the sequence. It will provide 3 different viewing perspectives: the first will show you only the G -> A mutations, the second all non-G -> A mutations, and the third is a combination of all mutations. To view the graph click on the xyplot option. The title in the plot indicates the reference sequence to which all other sequences are compared. The table contains the sequences in question represented as straight lines with their names at the end. The colored hashmarks represent the differences in that sequence compared to the reference sequence. The x-axis represents the sequence length enabling a reference point for a more detailed view over an area of interest, if so desired.

Should you prefer to download the data to create your own plots, you may select this option. Select from the File Menu "Save As.." and select your destination. The textfile saved from the main page can be imported into any word processing program. The textfiles contain the data for creating your own plot.

Data sheet for Hypermut-Original

The datasheet obtained contains 8 columns, described as follows:

  1. Sequences_names:

    The locus names of the sequences as they appear in the input alignment.

  2. Ratio:

    This represents the ratio of #G->A/#A->G .This value was recommended by Jean Carr as a useful value for hypermutation recognition.

  3. perc_Gs:

    The percentage of Guanines in the reference sequence that have undergone Guanine -->Adenine transitions.

  4. #diffs:

    The total number of positions in which the given sequence differs from the reference sequence.

  5. #G->A:

    The total number of these substitutions that involve a Guanine in the reference sequence being substituted by an Adenine in the given sequence.

  6. #A->G:

    The total number of these substitutions that involve an Adenine in the reference sequence being substituted by a Guanine in the given sequence.

  7. Dinuc Context:

    A tally of the dinucleotide contexts of the Guanine --> Adenine transitions. This represents two contiguous bases in the reference strain and summarizes the context of the G --> A changes.

  8. Observed changes:

    A documentation of all substitutions between the reference and the given sequence. The first letter in each pair represents the character state in the reference sequence; the second letter represents the state in the given sequence (e.g. GC GC are two Guanine --> Cytosine substitutions in a row.) Hypermutation can be regional, and a concentration of GCs embedded in a longer sequence may indicate hypermutation.

Hypermut 2.0 output

The output from Hypermut 2.0 tells you the pattern definitions that were used, then lists the sequences and their statistics. The Fisher Exact P-value may be a useful way to determine if a specific sequence is a hypermutant. A graphical output allows you to view the locations of the mutations detected, or the cumulative number of mutations across the length of the sequences. The latter option usually provides a more intuitive view of the results.

See additional information about Hypermut 2.0: Hypermut 2.0 Details

Questions or comments? Contact us at hfv-info@lanl.gov