Gene Cutter is useful for extracting protein sequences from viral DNA sequences. It uses a reference sequence to decide where to cut the sequence into genes (details below). It appropriately handles introns and overlapping genes.
Gene Cutter is useful for annotating coding regions, as needed for deposit of DNA sequences to GenBank. For HIV-1, HIV-2, and SIV, we provide a GenBank Entry Generation tool, which incorporates Gene Cutter results into a sequin file ready for deposit.
Gene Cutter does not align non-coding regions or LTRs. You may need to use other alignment tools to correctly handle these regions. Our HCVAlign tool provides similar functions to Gene Cutter, and its MAFFT option will align untranslated regions appropriately.
Regions to align and extract
Gene Cutter can give you just one gene or region, or all genes that your input touches.
Codon align the region
This option will insert gaps into your alignment so that it stays in the correct reading frame, even if your sequences contain frameshifts.
Output format and translation options
If you want your results as nucleotides, choose "Do not translate". If you want your output as amino acids, you have 3 choices, described below. If your sequences contain no IUPAC codes, you may select any of the 3 translation options and your results will be identical.
If you request translated output and your sequences contain IUPAC (ambiguity) codes, they can be translated in 3 possible ways:
Note: regardless of which translation option is selected, the presence of IUPAC characters may result in a translation that cannot be read by sequence editor and analysis programs!
Symbols in output
Translations are in the standard 1-letter amino acid alphabet.
# = frame shift or partial codon
$ = stop codon (in nucleotide output)
^ = stop codon (in amino acid output)
Note: codons containing "-" are always translated to either "-" (gap) or "#" (partial codon)
Gene Cutter has no limit on the number of input sequences, but please observe these suggestions!
How it Works
How Gene Cutter aligns the sequences
Because it contains an internal reference sequence, Gene Cutter frequently gives a better multiple alignment than computationally-based alignment programs. (Gene Cutter uses Hmmer v 2.32 with a training set of the full-length genome alignment).
NOTE: Mis-alignments at the ends of a coding region may result in a few amino acids/bases not appearing in the output.
How Gene Cutter finds the genes and proteins
Gene Cutter clips the coding regions from a nucleotide alignment and (optionally) codon aligns the sequences. To define the boundaries of genes or domains of interest, and to codon-align the sequences, Gene Cutter uses the coordinates from the HCV reference sequence H77.
How Gene Cutter codon-aligns
The sequences in the alignment are internally aligned to the H77 reference sequence (provided by the program). This reference sequence is annotated with the correct reading frame for all genes, so the program knows where to start the translation. Gaps will be inserted in groups of 3, or shifted to form groups of 3, and are inserted only between codons, not in the middle of a codon. In some sequences, insertions are compensated within a short distance by a deletion, or vice versa. Because these frameshifts may not inactivate the protein, if a compensating mutation is within 5 amino acids of an initial frameshift, Gene Cutter will shift it so that the reading frame is left intact. Otherwise, the frame shift is marked in the output with the hash symbol (#), and the translation is continued in the correct reading frame beyond that codon. Stop codons are marked by a dollar sign ($).