The "Stockholm" format is a system for marking up features in a multiple alignment. These mark-up annotations are preceded by a 'magic' label, of which there are four types. The Stockholm format is used by HMMER, Pfam, and Belvu.
The complete specification of the Stockholm format, version 1.0, is:
The first line in the file must contain a format and version identifier, currently:
# STOCKHOLM 1.0
<seqname> <aligned sequence>
<seqname> <aligned sequence>
<seqname> <aligned sequence>
.
.
.
//
<seqname> stands for "sequence name", typically in the form "name/start-end" or just "name".
The "//" line indicates the end of the alignment.
Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".
Wrap-around alignments are allowed in principle, mainly for historical reasons, but are not used in e.g. Pfam. Wrapped alignments are discouraged since they are much harder to parse.
Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.
#=GF <feature> <Generic per-File annotation, free text>
#=GC <feature> <Generic per-Column annotation, exactly 1 char per column>
#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>
#=GR <seqname> <feature> <Generic per-Sequence AND per-Column markup, exactly 1 char per column>
Magic or recommended features:
#=GF
For embedding trees:
#=GF NH <tree in New Hampshire eXtended format> #=GF TN <Unique identifier for the next tree>Notes:
Feature Description --------------------- ----------- AC <accession> ACcession number DE <freetext> DEscription DR <db>; <accession>; Database Reference OS <organism> OrganiSm (species) OC <clade> Organism Classification (clade, etc.) LO <look> Look (Color, etc.)#=GR
Feature Description Markup letters ------- ----------- -------------- SS Secondary Structure [HGIEBTSCX] SA Surface Accessibility [0-9X] (0=0%-10%; ...; 9=90%-100%) TM TransMembrane [Mio] PP Posterior Probability [0-9*] (0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00) LI LIgand binding [*] AS Active Site [*] pAS AS - Pfam predicted [*] sAS AS - from SwissProt [*] IN INtron (in or after) [0-2]Note: Do not use multiple lines with the same #=GR label. Only one unique feature assignment can be made for each sequence.
"X" in SA and SS means "residue with unknown structure".
In SS the letters are taken from DSSP: H=alpha-helix, G=3/10-helix, I=p-helix, E=extended strand, B=residue in isolated b-bridge, T=turn, S=bend, C=coil/loop.)
Recommended placements:
#=GF Above the alignment
#=GC Below the alignment
#=GS Above the alignment or just below the corresponding sequence
#=GR Just below the corresponding sequence
No size limits on any field.
However, a simple parser that uses fixed field sizes should work safely on Pfam alignments with these limits:
Line length: 10000.
<seqname>: 255.
<feature>: 255.
# STOCKHOLM 1.0 #=GF ID CBS #=GF AC PF00571 #=GF DE CBS domain #=GF AU Bateman A #=GF CC CBS domains are small intracellular modules mostly found #=GF CC in 2 or four copies within a protein. #=GF SQ 67 #=GS O31698/18-71 AC O31698 #=GS O83071/192-246 AC O83071 #=GS O83071/259-312 AC O83071 #=GS O31698/88-139 AC O31698 #=GS O31698/88-139 OS Bacillus subtilis O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS #=GR O83071/192-246 SA 999887756453524252..55152525....36463774777 O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY #=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAIPVLDPS #=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE #=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH #=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE #=GR O31699/88-139 AS ________________*__________________________ #=GR_O31699/88-139_IN ____________1______________2__________0____ //