Introduction to proteomics and its Applications

 

Proteomics

Proteomics represents a comprehensive scientific study of all expressed proteins or entire proteome at any given time in an organism. Proteins are key macromolecules present in all living organisms. Proteomics is a rapidly growing area of molecular biology that is concerned with the systematic, large-scale analysis of proteins. It is based on the concept of the proteome as a complete set of proteins produced by a given cell or organism under a defined set of conditions. Proteins are involved in almost every biological function, so a comprehensive analysis of the proteins in the cell provides a unique global perspective on how these molecules interact and cooperate to create and maintain a working biological system. The cell responds to internal and external changes by regulating the level and activity of its proteins, so changes in the proteome, either qualitative or quantitative, provide a snapshot of the cell in action. The proteome is a complex and dynamic entity that can be defined in terms of the sequence, structure, abundance, localization, modification, interaction and biochemical function of its components, providing a rich and varied source of data. The analysis of these various properties of the proteome requires an equally diverse range of technologies. We begin by tracing the origins of proteomics in the genomics revolution of the 1990s and following its evolution from a concept to a mainstream technology with a current market value of over $1.5 billion.

The overall goal of molecular biology research is to determine the functions of genes and their products, allowing them to be linked into pathways and networks, and ultimately providing a detailed understanding of how biological systems work. For most of the last 50 years, research in molecular biology has focused on the isolation and characterization of individual genes and proteins because there was neither the information nor the technology available for larger scale investigations. The only way to study biological systems was to break them down into their components, look at these individually, and attempt to reassemble each system from the bottom up. This approach is known as reductionism, and it dominated the molecular life sciences until the early 1990s. The face of biological research began to change in the 1990s as technological breakthroughs made it possible to carry out large-scale DNA sequencing. Until this point, the sequences of individual genes and proteins had accumulated slowly and steadily as researchers cataloged their new discoveries. This can be seen from the steady growth in the GenBank sequence database from 1980–1990 (Figure 1.1). The 1990s saw the advent of factory[1]style automated DNA sequencing, resulting in a massive explosion of sequence data (Figure 1.1). In the early 1990s, much of the new sequence data was represented by expressed sequence tags (ESTs), short fragments of DNA obtained by the random sequencing of cDNA libraries. In 1995, the first complete cellular genome sequence was published, that of the bacterium Haemophilus influenzae. In the next few years, over 100 further genome sequences were completed, including our own human genome which was essentially finished in 2003. The large-scale sequencing projects ushered in the genomics era, which effectively removed the information bottleneck and brought about the realization that biological systems, while large and very complex, were ultimately finite. The idea began to emerge that it might be possible to study biological systems in a holistic manner simply by cataloging and enumerating the components if sufficient amounts of data could be collected and analyzed. Unfortunately, while the technology for genome sequencing had advanced rapidly, the technology for studying the functions of the newly discovered genes lagged far behind. The sequence databases became clogged with anonymous sequences and gene fragments, and the problem was exacerbated by the unexpectedly large number of new genes found even in well-characterized organisms.

 



As an example, consider the bakers’ yeast Saccharomyces cerevisiae, which was thought to be one of the best-characterized model organisms prior to the completion of the genome-sequencing project in 1996. Over 2000 genes had been characterized in traditional experiments and it was thought that genome sequencing would identify at most a few hundred more. Scientists got a shock when they found the yeast genome contained over 6000 genes, nearly a third of which were unrelated to any previously identified sequence. Such genes were described as orphans because they could not be assigned to any classical gene family (Figure 1.2). The availability of masses of anonymous sequence data for hundreds of different organisms has precipitated a number of fundamental changes in the way research is conducted in the molecular life sciences. Traditionally gene function had been studied by moving from phenotype to gene, an approach sometimes called forward genetics. An observed mutant phenotype (or purified protein) was used as the starting point to map and identify the corresponding gene, and this led to the functional analysis of that gene and its product. The opposite approach, sometimes termed reverse genetics, is to take an uncharacterized gene sequence and modify it to see the effect on phenotype. As more uncharacterized sequences have accumulated in databases, the focus of research has shifted from forward to reverse genetics. Similarly, most research prior to 1995 was hypothesis driven, in that the researcher put forward a hypothesis to explain a given observation, and then designed experiments to prove or disprove it. The genomics revolution instigated a progressive change towards discovery-driven research, in which the components of the system under investigation are collected irrespective of any hypothesis about how they might work. The final paradigm shift concerns the sheer volume of data generated in today’s experiments. Whereas in the past researchers have focused on individual gene products and generated rather small amounts of data,



now the trend is towards the analysis of many genes and their products and the generation of enormous datasets that must be mined for salient information using computers. Advances in genomics have thus forced parallel advances in bioinformatics, the computer-aided handling, analysis, extraction, storage and presentation of biological data.

 

The need for proteomics

Transcriptome analysis, genome-wide mutagenesis and RNA interference have risen quickly to dominate functional genomics technologies because they are all based on high-throughput clone generation and sequencing, two of the technology platforms that saw rapid development in the genome-sequencing era. But what do they really tell us about the working of biological systems? Nucleic acids, while undoubtedly important molecules in the cell, are only information-carriers. Therefore, the analysis of genes (by mutation) or of mRNA (by RNA interference or transcriptomic can only tell us about protein function indirectly. Proteins are the actual functional molecules of the cell.




In this sense, they are functionally the most relevant components of biological systems and a true understanding of such systems can only come from the direct study of proteins.

The importance of proteomics in systems biology can be summarized as follows:

·         The function of a protein depends on its structure and interactions, neither of which can be predicted accurately based on sequence information alone. Only by looking at the structure and interactions of the protein directly can definitive functional information be obtained.

·         Mutations and RNA interference are coarse tools for large-scale functional analysis. If the structure and function of a protein are already understood in fairly good detail, very precise mutations can be introduced to investigate its function further. However, for the large-scale analysis of gene function, the typical strategy is to completely inactivate each gene (resulting in the absence of the protein) or to overexpress it (resulting in overabundance or ectopic activity). In each case, the resulting phenotype may not be informative. For example, the loss of many proteins is lethal, and while this tells us the protein is essential it does not tell us what the protein actually does. Random mutagenesis can produce informative mutations serendipitously, but there is no systematic way to achieve this. Some proteins have multiple functions in different times and/or places, or have multiple domains with different functions, and these cannot be separated by blanket mutagenesis approaches.

·          The abundance of a given transcript may not reflect the abundance of the corresponding protein. Transcriptome analysis tells us the relative abundance of different transcripts in the cell, and from this we infer the abundance of the corresponding protein. However, the two may not be related because of post-transcriptional gene regulation. Not all the mRNAs in the cell are translated, so the transcriptome may include gene products that are not found in the proteome. Similarly, rates of protein synthesis and protein turnover differ among transcripts, therefore the abundance of a transcript does not necessarily correspond to the abundance of the encoded protein. The transcriptome may not accurately represent the proteome either qualitatively or quantitatively.

·         Protein diversity is generated post-transcriptionally. Many genes, particularly in eukaryotic systems, give rise to multiple transcripts by alternative splicing. These transcripts often produce proteins with different functions. Mutations, acting at the gene level, may therefore abolish the functions of several proteins at once. Splice variants are represented by different transcripts so it should be possible to distinguish them by RNA interference and transcriptome analysis, but some transcripts give rise to multiple proteins whose individual functions cannot be studied other than at the protein level.

·         Protein activity often depends on post-translational modifications, which are not predictable from the level of the corresponding transcript. Many proteins are present in the cell as inert molecules, which need to be activated by processes such as proteolytic cleavage or phosphorylation. In cases where variations in the abundance of a specific post-translational variant are significant, this means that only proteomics provides the information required to establish the function of a particular protein.

·         The function of a protein often depends on its localization. While there are some examples of mRNA localization in the cell, particularly in early development, most trafficking of gene products occurs at the protein level. The activity of a protein often depends on its location, and many proteins are shuttled between compartments (e.g. the cytosol and the nucleus) as a form of regulation. The abundance of a given protein in the cell as a whole may therefore tell only part of the story. In some cases, it is the distribution of a protein rather than its absolute abundance that is important.





Comments

Popular posts from this blog