Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution

Meizhuo Zhang, Catherine Putonti, Sergei Chumakov, Adhish Gupta, George E. Fox, Dan Graur, Yuriy Fofanov

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations


Numerous sequencing projects have unveiled partial and full microbial genomes. The data produced far exceeds one person's analytical capabilities and thus requires the power of computing. A significant amount of work has focused on the diversity of statistical characteristics along microbial genomic sequences, e.g. codon bias, G+C content, the frequencies of short subsequences (n-mers), etc. Based upon the results of these studies, two observations were made: (1) there exists a correlation between regions of unusual statistical properties, e.g. difference in codon bias, etc., from the rest of the genomic sequence, and evolutionary significant regions, e.g. regions of horizontal gene transfer; and (2) because no two microbial genomes look statistically identical, statistical properties can be used to distinguish between genomic sequences. Recently, we conducted extensive analysis on the presence/absence of n-mers for many microbial genomes as well as several viral and eukaryotic genomes. This analysis revealed that the presence of n-mers in all genomes considered (in the range of n, when the condition M<<4n holds, where M is the genome length) can be treated as a nearly random and independent process. Thus we hypothesize that one may use relatively small sets of randomly picked n-mers for differentiating between different microorganisms. Recently, we analyzed the frequency of appearance of all 8- to 12-mers present in each of the 200+ publicly available microbial genomes. For nearly all of the genomes under consideration, we observed that some n-mers are present much more frequently than expected: from 50 to over a thousand copies. Upon closer inspection of these sequences, we found several cases in which an overrepresented n-mer exhibits a bias towards being located in the coding or being located in the non-coding region. Although the evolutionary reason for the conservation of such sequences remains unclear, in some cases it is plausible to believe that sequences having a clear bias for non-coding regions may be because of their role in the DNA uptake/recombination process, being parts in insertion sequences, or serving as transcription factors recognition sites. Our analysis of the frequency of appearance of 6-mers for each microbial genome revealed regions that display unusual statistical properties with respect to their own genome. After inspection of the genes contained within these regions, we believe that such regions are likely to have been acquired into the genomic sequence through horizontal gene transfer.

Original languageEnglish (US)
Pages (from-to)13-18
Number of pages6
JournalAIP Conference Proceedings
StatePublished - 2006
Externally publishedYes
Event9h Mexican Symposium on Medical Physics - Guadalajara, Jalisco, Mexico
Duration: Mar 18 2006Mar 23 2006


  • Pathogen identification
  • Short subsequences
  • Statistical properties

ASJC Scopus subject areas

  • General Physics and Astronomy


Dive into the research topics of 'Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution'. Together they form a unique fingerprint.

Cite this