Dotter: A dot-matrix program with interactive greyscale rendering
for genomic DNA and Protein sequence analysis
Download Dotter Binaries
Download Dotter in Seqtools Package
There is also an electronic article on Dotter published in Gene-Combis
Introduction
Dotter is a graphical dotplot program for detailed comparison of two
sequences. Here, every residue in one sequence is compared to every
residue in the other sequence. The first sequence runs along the
x-axis and the second sequence along the y-axis. In regions where the
two sequences are similar to each other, a row of high scores will run
diagonally across the dot matrix. If you're comparing a sequence
against itself to find internal repeats, you'll notice that the main
diagonal scores maximally, since it's the 100% perfect self-match.
To make the score matrix more intelligible, the pairwise scores are
averaged over a sliding window which runs diagonally. The averaged
score matrix forms a three-dimensional landscape, with the two
sequences in two dimensions and the height of the peaks in the third.
This landscape is projected onto two dimensions by aid of greyscales -
the darker grey of a peak, the higher it is.
Dotter provides a tool to explore the visual appearance of this
landscape, as well as a tool to examine the sequence alignment it
represents. These tools are explained below.
Running Dotter
Command syntax:
dotter [options] query_seq subject_seq [X options]
The sequences may be either protein or DNA but when Dottering DNA vs.
protein the query_seq must be DNA. The sequences should be in Fasta
format or just raw sequence. Fasta format looks like this:
>Name annotation of any sort
MYWTTTAFLYFWQKSTGA
LMKQYWNCYLLPSLYTAV
Options:
-b Batch mode, write dotplot to
-l Load dotplot from
-m Memory usage limit in Mb (default 0.5)
-z Set zoom (compression) factor
-p Set pixel factor manually (ratio pixelvalue/score)
-W Set sliding window size. (K => Karlin/Altschul estimate)
-M Read in score matrix from (Blast format; Default: Blosum62).
-f Read feature segments from
-i Do NOT use installed private colormap, but share with other apps
-r Reverse and complement horizontal_sequence (DNA vs Protein)
-D Don't display mirror image in self comparisons
-w For DNA: horizontal_sequence top strand only (Watson)
-c For DNA: horizontal_sequence bottom strand only (Crick)
-q Horizontal_sequence offset
-s Vertical_sequence offset
The most important X options:
-acefont < font> Main font.
-font < font> Menu font.
(Any standard X option can also be used, such as -bg green -fg red.)
Dottering large DNA sequences like cosmids vs. cosmids, will take at
least 15 minutes even of the fastest workstation. In such cases, use
the -b (batch mode) option and run Dotter niced in the background.
Once it's finished, you read in the precalculated Dotter file with the
-l option. The only drawback of Dottering large sequences is that the
width of the sliding window size over which the averaging is done
cannot be changed quickly, since no pre-averaged matrix is stored.
However, extensive testing has showed that changing the sliding window
size from the default of 25 residues has no or very marginal positive
effects.
Dotter runs linear in space so has no practical limit for the length
of sequences - it will just take n^2 more time. If the matrix becomes
too big, Dotter automatically zooms out to fit it inside a 707x707
pixel window. The user can choose to use more memory with the -m
option - if it doesn't fit the screen Dotter will provide
scrollbars.
Normally, the identical mirror image of self-comparisons is not
displayed. Use the -D option to force it on. For DNA, both top
strands and the reverse complement of query_seq vs the top strand of
subject_seq will be calculated. Use the -w and -c options if you want
to only see one of these.
The Greyramp tool
To improve visualization, little peaks (noise) can be nullified by a
min cutoff. Similarly, significant peaks above a certain score can be
saturated by a max cutoff. Peaks between min and max use the
greyscales to show their strength. Since the cutoffs for the min and
max scores depend on the nature of the sequences at hand, it is
impossible to a priori know what they should be. The main
novelty of Dotter is that the user can 'play' with the min and max
cutoffs until he/she achieves the optimal separation between noise and
signal. This is not cheating, but a necessary visual aid.
The Alignment tool
To see the match that causes a given peak in the dotplot, move the
crosshair with the left mouse button to the peak and pop up the
alignment tool. Once in the proximity, use the cursor keys to move
the crosshair one residue at the time. See HELP for key
movements.
Note that dragging the crosshair with the alignment tool active is
very slow - it's best to quit it if you want to drag a lot.
Zooming in
Zoom in to a region by dragging with the middle mouse button. Dotter
will then start up a new independent Dotter job for that region.
Set width of the sliding window
The default width of 25 residues over which the pairwise scores are
averaged has proven very robust. There's normally no need to change
this and I don't expect any other windowsize to improve a lot.
Remember that the whole matrix has to be recalculated, so if it took a
long time to calculate it the first time, stay away from this
menu item!
Displaying multiple dotplots simultaneously
When looking for overlaps between many sequences, for instance when
assembling contigs, it can be uselful to make a dotplot of all
sequences vs. each other. This way any overlap will show up as a
diagonal in the corner of a subsequence dotplot. Dotter has a
built-in mechanism for this. To run Dotter on many sequences at once,
concatenate the sequence files (in fasta format (see above)). Then
run dotter on the concatenated sequence file against itself, and green
partitioning lines will appear between the sequences. At each
partitioning line, the name of the following sequence is printed.
These lines can be turned on and off with the button "Draw lines a
segment ends" in the "Feature series selection tool", which is
launched from the main menu.
dotter -f foo seq seq &
Dot matrix file format
Since Dotter allows saving and loading of dot-matrices, it can also be
used for displaying dot-matrices generated by other programs. The
dot-matrix is simply stored as a stream of bytes, one byte per pixel.
All rows of pixels (bytes) are concatenated to each other in a
wrap-around manner. To specify the size and other aspects of the
dot-matrix, a header precedes the pixel values. There are presently
two header formats supported by Dotter: a simple (old) and a more
complex, which Dotter saves its own dot-plots in. If you want to use
Dotter to display some arbitrary dot-matrix, you may not care about
things such as score matrices or window length. In that case you
should specify format 1 and omit the format 2 features (everything
after vertical_len).
The header consists of the following fields:
VARIABLE TYPE (bytes) RANGE USED_BY_FORMAT
-------- ------------ ----- --------------
fileformat unsigned char (1) 1-2 1, 2
zoomfactor int (4) 1, 2
horizontal_len int (4) 1, 2
vertical_len int (4) 1, 2
pixel_factor int (4) 2
window_length int (4) 2
score_matrix_name_length int (4) 2
score_matrix_name char (score_matrix_name_length) 2
score_matrix[24][24] int (4)*576 2
Fileformat is simply a version number for backwards compatibility, and
is currently 2. The zoomfactor (compression factor) equals 1, 2,
3... for the number of dots (residue pairs) compressed into one pixel
(zoomfactor 2 => 4 dots/pixels).
The most important thing to keep in mind is that for technical
reasons, horizontal_len and vertical_len have to be the smallest
multiple of 4 greater or equal to the actual sequence length.
So if the horizontal sequence is e.g. 197 long, horizontal_len must be
set to 200, and the pixel map must contain this number of
pixels. So if your matrix was made from two sequences of length 197
and 199, the pixelmap must contain 200x200 pixels.
The pixel_factor is the scaling factor between the real score of a dot
and the pixel value, which was used to generate the dot-matrix. The
value doesn't affect the display of the dot-matrix, only it's meaning
in absolute values.
The window_length is the length of the sliding window used to generate
the dot-matrix.
The score_matrix fields define the pairwise residue score matrix that
was used to generate the dot-matrix. The order of residues is:
ARNDCQEGHILKMFPSTWYVBZX*
Note that all integers are stored with the most significant byte
first! This is the default for fwrite on Irix and Sun, but the
reverse of DEC Alpha and Linux.
Limitations
Note that the old problems related to colormaps and the inability of
Dotter to work on 16 and 24-bit displays have been resolved since
version 3.0, thanks to Simon Kelley's introduction of the GTK graphics
library. Note however that Dotter runs slower on 16 and 24-bit
displays than on an 8-bit display.
Reference
If you use this program for your work, please reference:
"A dot-matrix program with dynamic threshold control suited for genomic
DNA and protein sequence analysis"
Erik L.L. Sonnhammer and Richard Durbin
Gene 167:GC1-10 (1995)