Lesson 27. BIOINFORMATICS AND GENOMICS, TRANSCRIPTOMICS AND PROTEOMICS

Module 6. Bioinformatics and ‘Omics’ revolution

Lesson 27
BIOINFORMATICS AND GENOMICS, TRANSCRIPTOMICS AND PROTEOMICS

27.1 Definition of Bioinformatics

As per Oxford dictionary, Bioinformatics is defined as conceptualizing biology in terms of molecules (in the sense of physical chemistry) and applying "informatics techniques" (derived from disciplines such as applied maths, computer science and statistics) to understand and organize the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications.

27 a

The term was coined by Paulien Hogeweg in 1978. It involves development and application of computer hardware and software to the acquisition, storage and analysis of the tremendous biological data. The data in bioinformatics pertains to nucleotide (genomics) as well as protein (proteomics) sequences and many more related to ‘omics’ technologies which need to be analysed by Bioinformatic analysis tools as shown in Fig. 27.1.

12-11

Fig. 27.1 Data analysis obtained from various ‘Omics’ technologies by bioinformatic tools


Bioinformatics make extensive use of mathematical and statistical models which are hidden in the computer science aspect of bioinformatics. The mathematics involved mainly deals with algorithms which is a step-wise method for solving a problem. The techniques and algorithms were specifically developed for analysis of biological data.

27.2 Aims of Bioinformatics

The bioinformatics is aimed at achieving the following targets.

1. Bioinformatics organize data in a way that allows researchers to access existing information and to submit new entries particularly in nucleotide and protein databases e.g. GenBank and Protein Data Bank.

2. Bioinformatics also develop tools and resources required in the analysis of data e.g. development of software tools to compare nucleotide and protein sequences and also their alignment with the existing sequences stored in the form of database.

3. Bioinformatics involve the application of the analysis tools to analyze the data and interpret the results in a biologically meaningful manner.

27.3 Databases

A database consists of an organized collection of vast data for one or more uses, typically in digital form that can be easily accessed, managed and updated. Initially, a database was created in USA and UK. The database includes associated tools (software) necessary for access, updating, information insertion and information deletion. The following databases are available for global access.

• Sequences (DNA / Nucleotide, protein)
• Genomics
• Mutation/polymorphism
• Protein domain/family
• Proteomics (2D gel, Mass Spectrometry)
• 3D structure
• Metabolism
• Bibliography
• Expression (Microarrays)
• PubMed
• SNP (single nucleotide polymorphism)
• Specialized i.e. Expressed Sequence Tags (EST), Sequence tagged sites (STS) etc.

27.4 National Centre for Biotechnology Information (NCBI) Resource

NCBI was established in 1988 in USA. The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institute of Health. The NCBI houses genome sequencing data in GenBank and an index of biomedical research articles in PubMed Central as well as other information relevant to biotechnology. All these databases are available online through the Entrez search engine.

27.4.1 Nucleotide sequence databases

The International Nucleotide Sequence Database Collaboration (INSDC) consists of joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves three major databases i.e. GenBank (www.ncbi.nlm.nih.gov/GenBank); European Molecular Biology Laboratory (EMBL,www.ebi.ac.uk/embl) and DNA Data Bank of Japan (DDBJ, www.ddbj.nig.ac.jp).

27.4.1.1 GenBank database

The GenBank database was created in 1982 with funding from National institute of Health (NIH), the National Science Foundation, the Department of Energy and the Department of Defense, USA. The NCBI has had the responsibility for making available the GenBank DNA sequence database since 1992 to all the scientific community. The GenBank sequence database is an open access, annotated collection of all publically available nucleotide sequences as well as their protein translations. Sequences can be submitted to GenBank using BankIt which is a web based form or Sequin which is a stand alone submission. GenBank also coordinates with individual laboratories and other sequence databases such as those of the EMBL and DDBJ.

27.4.1.2 EMBL nucleotide sequence database

The EMBL database constitutes Europe’s primary nucleotide sequence resource. It includes direct submissions from researchers, genome sequencing projects and patents.

27.4.1.3 DNA Data Bank of Japan (DDBJ)

DDBJ began its activities in 1986 at National Institute of Genetics (NIG) with the endorsement of Ministry of Education, Science, Sport and Culture of Japan. It collects nucleotide sequences from researchers and issues internationally recognized accession numbers to submitters.

The DDBJ/EMBL/GenBank synchronization is maintained according to the published guidelines by International Advisory Board.


27.4.2 Protein databases

Some of the frequently used protein database include Protein Data Bank (PDB, www.rcsb.org/pdb); Swiss-Prot (www.expasy.ch/sprot/sprot-top.html); Protein Information Resource (PIR, www.mips.biochem.mpg.de/proj/protsedb) and Uniprot (www.uniprot.org).

27.4.3 Genome database

The Genome database consists of sequences of complete or on-going genomes, sequence maps with contigs and is organized into six main groups i.e. Archaea, Bacteria, Eucaryote, Viruses, Viroids and Plasmids as well as draft genome assemblies.

27.5 Software Tools For Data Analysis

The NCBI has software tools that are available by www browsing or by FTP. The application of software tools is in three major areas i.e. sequence, structure and function analysis.

27.5.1 Sequence analysis

27.5.1.1 BLAST (Basic Local Alignment Search Tool)

BLAST is a sequence similarity searching program. BLAST can do sequence analysis of both nucleotide or protein, for comparison against the GenBank DNA/protein database in less than 15 seconds. Types of sequence alignments are pairwise or multiple. Pairwise alignment considers one pair of sequence at a time whereas multiple alignment takes into consideration more than one sequence at a time. It is a progressive, linear, pairwise alignment which provides information about conserved sequences in a group of closely related or distantly related organisms. BLAST can be used to infer functional and evolutionary relationships between sequences as well as to help identify members of gene families.

27.5.1.2 BLAST microbial genomes

It performs a BLAST search for similar sequences from selected complete eukaryotic and prokaryotic genomes.

27.5.1.3 Open reading frame finder (ORF finder)

ORF analysis tool finds all open reading frames in a user's sequence or in a sequence already in the database. Sixteen different genetic codes can be used. The deduced amino acid sequence can be saved in various formats and searched against protein databases using BLAST.

27.5.1.4 Primer blast

The Primer-BLAST tool uses Primer 3 to design PCR primers to a sequence template which are then automatically analyzed with a BLAST search against user specified databases, to check the specificity of the primers.

27.5.1.5 VecScreen

VecScreen identifies segments of a nucleic acid sequence that may be of vector origin. It searches a query sequence for segments that match any sequence in a specialized non-redundant vector database (UniVec).

27.5.2 Structural analysis

Structural analysis involves tools to compare and analyse new protein structures with the known protein structures available in the databases.

27.5.3 Functional analysis

Functional analysis tools are used for gene expression profiling, protein protein interaction and prediction etc.

27.6 Bio informatics for Next Generation Sequence Analysis

Next generation sequencing technologies such as Solexa, Pyrosequencing and SoLiD are currently being used for genome sequencing which generate enormous data. Hence, a variety of software tools are being made available for analyzing the vast data which mainly include de novo sequence assembly and associated tools. The de novo sequence assembly tools in fact provide a better platform for analysis of whole genome sequence. Algorithms have been developed for assembly of very short reads. Genome annotation tools comprising of structural and functional analysis have also been developed which assigns biological function to the genes.

27.7 Applications of Bio informatics
  • With the large surge of data, the computation tools have become indispensable in all branches of life sciences.
  • The drug design process has become much faster and the cost of drug design also has decreased. A new field pharmacogenomics allows scientists to use bioinformatics tools to design and prescribe personal medications to individuals.
  • Drug targets in infectious organisms can be revealed by whole genome comparisons of infectious and non–infectious organisms.
  • Bioinformatics could also be immensely useful in clinical diagnostics as it helps to diagnose genetic disorders and other health problems easily.
  • In silico screening for small molecule ligands to develop them as potential drugs against infectious agents and cancer.
27.8 Genomics

With the development of ‘Omics’ era (Fig. 27.2) structure and function of the genes can be deciphered. The basic principle of ’Omics‘ technologies has been given in Fig. 27.3.

27.3

Fig. 27.3 Principle of “Omics’ technologies from genes to proteins


27 b

‘Omics’ Sciences generate massive amount of data which require powerful statistical tools for analysis. These sciences have enormous role in understanding the normal biological function, disease and personalized health care. One of the first omic sciences “Genomics” is the study of the genomes of the organisms or can be defined as the comprehensive study of the genetic information of the cell or an organism. It can also be referred to as the ‘omics’ study of genes of individual organisms, populations, and/or species. A genome is the sum total of all an individual organism's genes. The term 'genomics' has been coined by Dr. Tom Roderick, a geneticist at the Jackson Laboratory (Bar Harbor, Maine) at a meeting held in Maryland on the mapping of the human genome in 1986. Genomics integrate intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping. The first genome to be sequenced was that of a virus and mitochondrion by Fred Sanger and his group during 1970-1980. Frederick Sanger sequenced the first DNA based genome of bacteriophage Φ-X174; (5,368 bp) in 1977. The first free-living organism to be sequenced was that of Haemophilus influenzae (1.8Mb) in 1995, and since then genomes are being sequenced at a rapid pace.

The importance of deciphering the entire genome sequence of humans was recognized more than two decades ago and it was the very first step in ushering into the field of genomics. Human Genome Project was initiated at National Institutes of Health (NIH) in the United States with an International consortium and the project was funded by Department of Energy. Simultaneously, Dr. J. Craig Venter of The Institute for Genomics Research (TIGR) and Celera Genomics also took an initiative to map human genome. NIH-led consortium used tiling array of large insert clones to sequence the genome. Venter group employed a random shotgun sequencing method followed by computer analysis of sequences obtained to assemble the overlapping sequences. On June 26, 2000, it was announced that human genome had been successfully deciphered. Thus, finally, the draft human genome sequence was completed ahead of schedule in 2001 and the project was completed in 2003. This discovery opened thousands of new doors for scientific research, offering vast opportunities for better health, longer lives and richer human understanding. However, with the fast advancements in sequencing technologies, the entire human genome can be sequenced in less than a month’s time. With the advancements in the second (next) generation sequencing technologies, the cost of genome sequencing is becoming so low to make personal genomics a reality. The cost of sequencing an entire human genome will be approximately $1000 in the near future and a single prokaryotic genome sequence will cost only around $1. The 1000 Genomes Project (www.1000genomes.org), an international public–private consortium announced the completion of three pilot projects and the deposition of the final resulting data in freely available public databases for use by the research groups world over and is the largest one to build the most detailed map of human genetic variation to date. The three industrial participants included 454 Life Sciences (Roche), Applied Biosystems (CA, USA) and Illumina Inc. (San Diego). Work has also begun on the full-scale effort to build a public database containing information from the genomes of 2,500 people from 27 populations around the world.

Recently, Council of Scientific and Industrial Research (CSIR), India, has been successful in achieving completion of first ever human genome sequencing in India. Sridhar Sivasbbu and Vinod Scaria of IGIB, New Delhi successfully sequenced the first India’s genome of a 52 year old Jharkhand resident by cracking the 310 crore base pairs. DNA mapping has revealed marked genetic variations suggesting vulnerability of Indians to cardiovascular diseases (CVD), colorectal cancer and schizophrenia. However, sequencing needs to be performed on large number of people across the regions to draw any conclusion. IGIB plans to map the DNA of ten Indians from different states. With the completion of the first human genome sequence in India, India is now in the league of select few countries like the US, China, Canada, UK, and Korea who have demonstrated the capability to sequence and assemble complete human genomes. CSIR could achieve completion of human genome in 45 days by adopting next generation sequencing technology, resulting in over 13x coverage of the human genome and by effectively integrating complex computational tools with high throughput analytical capabilities using supercomputer. The sequencing of the first human genome in India in conjunction with Indian Genome Variation program opens newer possibilities in disease diagnostics, treatment and low-cost drugs for healthcare. Many microbial sequencing projects have also been already completed or are being carried out. A total of 1525 microbial genomes and 1112 eucaryotic genomes have been sequenced. Entrez Genomes currently contains 3805 reference sequences for 2621 viral genomes and 41 reference sequences for viroids. A number of comparative genome studies are under way to link genotype and phenotype at the genomic level.

Genome projects and sequences are available at:
  • DOE Joint Genome Institute: Human, plant, animal, and microbial sequencing.
  • GOLD -- Genomes Online Database provides comprehensive access to information regarding complete and ongoing genome projects around the world.
  • Comprehensive Microbial Resource -- A tool that allows the researcher to access all of the bacterial genome sequences completed to date.
  • Entrez Genome project -- A resource from the National Center for Biotechnology Information (NCBI) for accessing information about completed and in-progress genomes.
27.8.1 Applications
  1. Analysis of genes at the functional level is one of the main uses of genomics (Functional genomics)
  2. With the help of genomics, we can study evolutionary relationships between populations, species and genera.
  3. Molecular diagnostics
  4. Improving quality and efficiency of next generation technologies in terms of simplicity, time and cost.
27.9 Comparative Genomics

Comparative genomics is the analysis and comparison of genomes from different species. The idea is to gain a better understanding of evolution of different species and also to determine the function of genes and non-coding regions of the genome Comparative genomics involves the use of computer programs that can align multiple sequences and look for regions of similarity among them. The sequence similarity tools like BLAST are accessible from the National Center for Biotechnology Information NCBI) and ClustalW. BLAST is a set of programs designed to perform similarity searches on all available sequence data.

27.10 Metagenomics

Metagenomics is the study of metagenomes - the genetic material recovered directly from environmental samples i.e community analysis. The underlying strategy for metagenomic study is given in Fig. 27.4. Advancements in sequencing technologies involving pyrosequencing, illumina etc. can be exploited to get information about the genes from all the members of sampled communities. The power of genomic analysis is applied to entire communities of microorganisms for discovery of new micro-organisms or their functional properties by bypassing the need to isolate and culture individual microbial species. Metagenomics find application in almost all the frontier areas of science including agriculture, environment, energy and human health etc.

12-12

Fig. 27.4 Meta genomic library preparation from different environmental samples


Till now we have very little information about the bacterial diversity present on this earth since only less than 1% of bacteria have been cultured due to lack of knowledge of physiological and cultural conditions required for cultivation of such unknown organisms using laboratory media. The term ‘metagenomics’ was coined by Jo Handleman and others in the University of Wisconsin and first appeared in a publication in 1998 and relied on 16s rRNA sequences. The gold standard for molecular identification of microbial species is the phylogenetic analysis of small-subunit rRNA genes (SSUrDNA), which are present in all cellular organisms from the environmental samples using DNA extracted from the composite samples. With recent advancement in sequencing technologies, job has been simplified to a great extent and has enabled the researchers to sample all genes from all the members of sampled communities to study their diversity. The information obtained from metagenomics provide the information both on the type of organism and the metabolic process. Powerful molecular microbiota analyses methods, including 16S rRNA sequencing through a massively parallel barcoded pyrosequencing approach, facilitate our ability to analyze microbiota in environmental samples comprehensively and in an efficient manner.

27.11 Functional Genomics

Functional genomics is the study of function-related aspects of the genome which encompasses transcriptomics, proteomics and metabolomics which we will study in the next section. Although, the complete sequence of human genome is available, the determination of function of various genes is still to be worked out in detail. With the development of several high through put molecular technologies besides mutagenesis and gene knock out techniques, it is now possible to decipher the gene function at a faster pace. Only 60% of the genes have been annotated in E. coli and humans for their functions. Rest 40% include genes that are unique to the organism.

27.12 Transcriptomics

After completion of the sequencing of human genome, efforts are now made to determine the function of the genes located therein. With the advancements in Bioinformatics, the researchers predicted that there were only 20,000 – 25,000 transcripts encoded by the entire human genome. This indicated that all of the genes did not express themselves. Transcriptomics is the study of transcripts encoded by a particular gene. Transcriptome is the set of all the RNA molecules, including mRNA, rRNA, tRNA and non-coding RNA produced in one or a population of cells. The study of transcriptomics is also referred as expression profiling which examines the expression level of mRNAs in a given cell population under a given set of conditions since the expression level of genes may vary under different conditions. The genome is static but the transcriptome is highly dynamic and changing, due to varying patterns of gene expression. In any organism, the transcriptome of different cells is never identical. DNA microarray technology is a powerful tool to obtain a transcriptome and helps in studying the gene that are turned ‘on’ in a particular environment. Currently, the most widely used method for the analysis of transcriptomics is DNA microarray and RNA seq using Next generation sequencing technologies. The mRNA from the cells under study is extracted and then labeled with a fluorescent dye and placed in a DNA array slide spotted with a large number of DNA probes (as discussed under DNA microarray Chapter 3.7). The mRNA will attach to its complementary DNA on the microarray and gives fluorescence. This can be successfully used in identification of genes which are expressed under normal and diseased conditions. The science of transcriptomics is important for identifying the set of genes that are differentially expressed in distinct cell populations or subtypes, to obtain data on the likely proteins that will be found in a particular cell.

Global transcriptome analysis is an emerging area to investigate the role of genetic variants in several of the diseases like cancer, metabolic diseases like diabetes, CVD etc. In a recent study conducted in India, it has been revealed that Indians are more susceptible to diabetes because of genetic variants. Asian Indians are more prone to obesity and diabetes since the variants found in gene FTO and near MC4R cause a 2cm expansion in waist circumference which make them resistant to insulin leading to development of Type 2 diabetes.

27.12.1 Tools

Global analysis: high-density DNA microarrays (Please see Lesson 15 for DNA microarray)

Real Time PCR (RT-qPCR, Molecular Beacons)
RNA-Seq – Uses deep sequencing technologies (next generation technologies) to study the transcriptome at the nucleotide level. RNA-Seq provides a far more precise measurement of levels of transcripts and their isoforms than other methods.

27.12.2 Applications
  1. Transcriptomics find role in exploratory studies in order to elucidate the type of genes which are differentially expressed (normal versus diseased status) or co-expressed or interact which can enhance the knowledge concerning gene function, regulation and interaction.
  2. Prognostic studies to decipher the effect of drug to find the best treatment regimen.
27.13 Proteomics

Proteomics is the study of proteins, particularly their structure and functions. Proteins exist in several structures including primary, secondary, tertiary and quaternary. Tertiary structure of protein is important for its functionality. The following Fig. 27.5 shows the various structures of proteins e.g. primary, secondary, tertiary and quaternary.

27.5 a

Primary structure of a protein(Amino-acid sequence of the polypeptide chain by peptide)

27.5 b

Secondary structure of a protein Alpha helices & Beta sheets, Loops

27.5 c

Tertiary structure of protein arrangement of secondary elementin 3D space

27.5 d

Quaternary structure of protein Packing of several polypeptide chains

Fig. 27.5 Structure of proteins

The proteome is the entire complement of proteins including the modifications made to a particular set of proteins, produced by an organism or system. A proteome differs from cell to cell and constantly changes through its biochemical interactions with the genome and the environment and reflects the gene expression repertoire (Fig 27.6). The word "proteome" is a blend of "protein" and "genome", and was coined by Marc Wilkins in 1994. Sometimes, the mRNA level does not correlate with the protein content since amount of protein produced depends on a gene it is transcribed from or its post translational modifications which may change the function of protein. On the other hand, an mRNA produced in abundance may be degraded rapidly or translated inefficiently, resulting in a small amount of protein. Around 25- 30,000 genes code for at least 100,000 proteins in human cells, mainly due to a variety of post translational modifications such as phosphorylation, ubiquitination, methylation, acetylation, glycosylation, oxidation, nitrosylation etc. Thus, the set of proteins present in any cell at any given time can vary to a large extent and the utility of transcriptomics becomes limited. Proteomic research aims to develop markers of disease expression and find therapeutic solutions. The study of proteomics is quite complicated as compared to Genomics because the proteome varies with time and environment in the same cell.

Proteomics is rapidly becoming an essential part of biological research. In conjunction with advances in bioinformatics, it will have a major impact on our understanding of the phenotypes of both normal and diseased cells. Initially, 2D PAGE (polyacrylamide gel electrophoresis) was used to construct the protein maps. Recently, mass spectrometry has been incorporated to enhance sensitivity and specificity besides providing results in a high through put format. The strategy used for identification of an unknown protein is given in (Fig. 27.7). Any proteomic analysis is very costly, laborious, and time consuming. It yields huge amount of data which is difficult to interpret which requires careful designing of simple experiments.

27.13.1 Tools
  • Western blot
  • Immuno-histochemical staining
  • Enzyme linked immunosorbant assay (ELISA)
  • Mass spectrometry
  • 2 D gel electrophoresis
  • MALDI TOF/TOF
  • MALDI-MS
  • LS-MS/MS, LC/LC -MS/MS
  • Protein microarrays (or Biochips spotted with antibodies or proteins and probed with a complex protein mixture)
  • Bioinformatic tools (ExPASy Proteomics Tools)
27.13.2 Applications
  1. Identification of potential new drugs for the treatment of diseases. The proteins associated with the disease can be identified using information from genome and proteome. The 3D structure can provide the information to design drugs to interfere with the action of the protein.
  2. Specific protein biomarkers to diagnose disease
  3. Protein microarrays are used to study protein protein interactions.
References

Internet resources

Last modified: Friday, 2 November 2012, 6:27 AM