BIOINFORMATICS: GENOME WIDE ANALYSIS: Genome mapping, assembly and comparison.
Author: Shailesh Kumar Shukla
Genomics is the study of genomes. Genomic studies are characterized by simultaneous analysis of a large number of genes using automated data gathering tools. The topics of genomics range from genome mapping, sequencing, and functional genomic analysis to comparative genomic analysis. The advent of genomics and the ensuing explosion of sequence information is the main driving force behind the rapid development of bioinformatics today.
The genomic study can be tentatively divided into structural genomics and functional genomics. Structural genomics refers to the initial phase of genome analysis, which includes the construction of genetic and physical maps of a genome, identification of genes, annotation of gene features, and comparison of genome structures. The initiative of structural determination of proteins falls within the realm of structural proteomics and should not be confused as a subdiscipline of genomics.
The structure genomics discussed herein mainly deals with structures of genome sequences. Functional genomics refers to the analysis of global gene expression and gene functions in a genome.
The first step to understanding a genome structure is through genome mapping, which is a process of identifying relative locations of genes, mutations or traits on a chromosome. A low-resolution approach to mapping genomes is to describe the order and relative distances of genetic markers on a chromosome. Genetic markers are identifiable portions of a chromosome whose inheritance patterns can be followed.
For many eukaryotes, genetic markers represent morphologic phenotypes. In addition to genetic linkage maps, there are also other types of genome maps such as physical maps and cytologic maps, which describe genomes at different levels of resolution. Their relations relative to the DNA sequence on a chromosome are illustrated in figure 1.
The maps represent different levels of resolution to describe a genome using genetic markers. Cytologic maps are obtained microscopically. Genetic maps (grey bar) are obtained through genetic crossing experiments in which chromosome recombinations are analyzed.
Physical maps are obtained from overlapping clones identified by hybridizing the clone fragments (grey bars) with common probes (grey asterisks).
Genetic linkage maps, also called genetic maps, identify the relative positions of genetic markers on a chromosome and are based on how frequent the markers are inherited together. The distance between the two genetic markers is measured in centiMorgans (cM), which is the frequency of recombination of genetic markers.
Physical maps are maps of locations of identifiable landmarks on a genomic DNA regardless of inheritance patterns. The distance between genetic markers is measured directly as kilobases (Kb) or megabases (Mb). Because the distance is expressed in physical units, it is more accurate and reliable than centiMorgans used in genetic maps.
Cytologic maps refer to banding patterns seen on stained chromosomes, which can be directly observed under a microscope. The observable light and dark bands are the visually distinct markers on a chromosome. A genetic marker can be associated with a specific chromosomal band or region. The banding patterns, however, are not always constant and are subject to change depending on the extent of chromosomal contraction. Thus, cytologic maps can be considered to be of very low resolution and hence somewhat inaccurate physical maps. The distance between the two bands is expressed in relative units.
GENOME SEQUENCE ASSEMBLY
Initial DNA sequencing reactions generate short sequence reads from DNA clones. The average length of the reads is about 500 bases. To assemble a whole-genome sequence, these short fragments are joined to form larger fragments after removing overlaps. These longer, merged sequences are termed contigs, which are usually 5,000 to 10,000 bases long. A number of overlapping contigs can be further merged to form scaffolds (30,000–50,000 bases, also called supercontigs), which are unidirectionally oriented along with a physical map of a chromosome (Fig. 2). Overlapping scaffolds are then connected to create the final highest resolution map of the genome.
Schematic diagram showing three different levels of sequence assembly. Contigs are formed by combining raw sequence reads of various orientations after removing overlaps. Scaffolds are assembled from contigs and oriented unidirectionally on a chromosome. Because sequence fragments generated can be in either of the DNA strands, arrows are used to represent directionality of the sequences written in 5 → 3 orientation.
A commonly used constraint to avoid errors caused by sequence repeats is the so-called forward–reverse constraint. When a sequence is generated from both ends of a single clone, the distance between the two opposing fragments of a clone is fixed to a certain range, meaning that they are always separated by a distance defined by a clone length (normally 1,000 to 9,000 bases). When the constraint is applied, even when one of the fragments has a perfect match with a repetitive element outside the range, it is not able to be moved to that location to cause misassembly. An example of assembly with or without applying the forward-reverse constraints is shown in figure 3.
Example of sequence assembly with or without applying forward–reverse constraint, which fixes the sequence distance from both ends of a subclone.
What is comparative genomics?
Comparison of whole genomes from different organisms is comparative genomics, which includes a comparison of gene number, gene location, and gene content from these genomes. The comparison helps to reveal the extent of conservation among genomes, which will provide insights into the mechanism of genome evolution and gene transfer among genomes. It helps to understand the pattern of acquisition of foreign genes through lateral gene transfer. It also helps to reveal the core set of genes common among different genomes, which should correspond to the genes that are crucial for survival.
1. Luscombe NM, Greenbaum D and Gerstein M (2001). What is bioinformatics? A proposed definition and overview of the field. Methods Inf. Med. 40: 346-35810.1053/j.ro.2009.03.010.
2. Algorithms in bioinformatics. A practical introduction by Wing-Kin Sung Updated 12 July 2016,
3. Nisbet, Robert (2009). "Bioinformatics". Handbook of Statistical Analysis and Data Mining Applications. John Elder IV, Gary Miner. Academic Press. p. 328. ISBN 978-0080912035. Retrieved 9 May 2014.
4. REVIEW-ARTICLE Bioinformatics: an overview and its applications by Wellison Charles DaSilva Diniz Updated 16 March 2017,
5. An outlook into ultra-scale visualization of large-scale biological data by Nagiza F. Samatova updated 12 may 2008,