Introduction to proteomics and its Applications
Proteomics
Proteomics represents a comprehensive scientific study of all
expressed proteins or entire proteome at any given time in an organism. Proteins
are key macromolecules present in all living organisms. Proteomics is a
rapidly growing area of molecular biology that is concerned with the
systematic, large-scale analysis of proteins. It is based on the concept of the
proteome as a complete set of proteins produced by a given cell or organism
under a defined set of conditions. Proteins are involved in almost every
biological function, so a comprehensive analysis of the proteins in the cell
provides a unique global perspective on how these molecules interact and
cooperate to create and maintain a working biological system. The cell responds
to internal and external changes by regulating the level and activity of its
proteins, so changes in the proteome, either qualitative or quantitative,
provide a snapshot of the cell in action. The proteome is a complex and dynamic
entity that can be defined in terms of the sequence, structure, abundance,
localization, modification, interaction and biochemical function of its
components, providing a rich and varied source of data. The analysis of these
various properties of the proteome requires an equally diverse range of
technologies. We begin by tracing the origins of proteomics in the genomics
revolution of the 1990s and following its evolution from a concept to a
mainstream technology with a current market value of over $1.5 billion.
The overall goal of molecular biology research is to
determine the functions of genes and their products, allowing them to be linked
into pathways and networks, and ultimately providing a detailed understanding
of how biological systems work. For most of the last 50 years, research in
molecular biology has focused on the isolation and characterization of
individual genes and proteins because there was neither the information nor the
technology available for larger scale investigations. The only way to study
biological systems was to break them down into their components, look at these
individually, and attempt to reassemble each system from the bottom up. This
approach is known as reductionism, and it dominated the molecular life sciences
until the early 1990s. The face of biological research began to change in the
1990s as technological breakthroughs made it possible to carry out large-scale
DNA sequencing. Until this point, the sequences of individual genes and
proteins had accumulated slowly and steadily as researchers cataloged their new
discoveries. This can be seen from the steady growth in the GenBank sequence
database from 1980–1990 (Figure 1.1). The 1990s saw the advent of factory[1]style
automated DNA sequencing, resulting in a massive explosion of sequence data
(Figure 1.1). In the early 1990s, much of the new sequence data was represented
by expressed sequence tags (ESTs), short fragments of DNA obtained by the
random sequencing of cDNA libraries. In 1995, the first complete cellular
genome sequence was published, that of the bacterium Haemophilus influenzae. In
the next few years, over 100 further genome sequences were completed, including
our own human genome which was essentially finished in 2003. The large-scale
sequencing projects ushered in the genomics era, which effectively removed the
information bottleneck and brought about the realization that biological
systems, while large and very complex, were ultimately finite. The idea began
to emerge that it might be possible to study biological systems in a holistic
manner simply by cataloging and enumerating the components if sufficient
amounts of data could be collected and analyzed. Unfortunately, while the
technology for genome sequencing had advanced rapidly, the technology for
studying the functions of the newly discovered genes lagged far behind. The
sequence databases became clogged with anonymous sequences and gene fragments,
and the problem was exacerbated by the unexpectedly large number of new genes
found even in well-characterized organisms.
As an example, consider the bakers’ yeast Saccharomyces
cerevisiae, which was thought to be one of the best-characterized model
organisms prior to the completion of the genome-sequencing project in 1996.
Over 2000 genes had been characterized in traditional experiments and it was
thought that genome sequencing would identify at most a few hundred more.
Scientists got a shock when they found the yeast genome contained over 6000
genes, nearly a third of which were unrelated to any previously identified
sequence. Such genes were described as orphans because they could not be
assigned to any classical gene family (Figure 1.2). The availability of masses
of anonymous sequence data for hundreds of different organisms has precipitated
a number of fundamental changes in the way research is conducted in the
molecular life sciences. Traditionally gene function had been studied by moving
from phenotype to gene, an approach sometimes called forward genetics. An
observed mutant phenotype (or purified protein) was used as the starting point
to map and identify the corresponding gene, and this led to the functional
analysis of that gene and its product. The opposite approach, sometimes termed
reverse genetics, is to take an uncharacterized gene sequence and modify it to
see the effect on phenotype. As more uncharacterized sequences have accumulated
in databases, the focus of research has shifted from forward to reverse
genetics. Similarly, most research prior to 1995 was hypothesis driven, in that
the researcher put forward a hypothesis to explain a given observation, and
then designed experiments to prove or disprove it. The genomics revolution
instigated a progressive change towards discovery-driven research, in which the
components of the system under investigation are collected irrespective of any
hypothesis about how they might work. The final paradigm shift concerns the
sheer volume of data generated in today’s experiments. Whereas in the past
researchers have focused on individual gene products and generated rather small
amounts of data,
The need for proteomics
Transcriptome analysis, genome-wide mutagenesis and RNA
interference have risen quickly to dominate functional genomics technologies
because they are all based on high-throughput clone generation and sequencing,
two of the technology platforms that saw rapid development in the
genome-sequencing era. But what do they really tell us about the working of biological
systems? Nucleic acids, while undoubtedly important molecules in the cell, are
only information-carriers. Therefore, the analysis of genes (by mutation) or of
mRNA (by RNA interference or transcriptomic can only tell us about protein
function indirectly. Proteins are the actual functional molecules of the cell.
The
importance of proteomics in systems biology can be summarized as follows:
·
The function of a protein depends on its structure and
interactions, neither of which can be predicted accurately based on sequence
information alone. Only by looking at the structure and interactions of the
protein directly can definitive functional information be obtained.
·
Mutations and
RNA interference are coarse tools for large-scale functional analysis. If the
structure and function of a protein are already understood in fairly good
detail, very precise mutations can be introduced to investigate its function
further. However, for the large-scale analysis of gene function, the typical
strategy is to completely inactivate each gene (resulting in the absence of the
protein) or to overexpress it (resulting in overabundance or ectopic activity).
In each case, the resulting phenotype may not be informative. For example, the
loss of many proteins is lethal, and while this tells us the protein is
essential it does not tell us what the protein actually does. Random
mutagenesis can produce informative mutations serendipitously, but there is no
systematic way to achieve this. Some proteins have multiple functions in
different times and/or places, or have multiple domains with different
functions, and these cannot be separated by blanket mutagenesis approaches.
·
The abundance of a given
transcript may not reflect the abundance of the corresponding protein.
Transcriptome analysis tells us the relative abundance of different transcripts
in the cell, and from this we infer the abundance of the corresponding protein.
However, the two may not be related because of post-transcriptional gene
regulation. Not all the mRNAs in the cell are translated, so the transcriptome
may include gene products that are not found in the proteome. Similarly, rates
of protein synthesis and protein turnover differ among transcripts, therefore
the abundance of a transcript does not necessarily correspond to the abundance
of the encoded protein. The transcriptome may not accurately represent the
proteome either qualitatively or quantitatively.
·
Protein
diversity is generated post-transcriptionally. Many genes, particularly in
eukaryotic systems, give rise to multiple transcripts by alternative splicing.
These transcripts often produce proteins with different functions. Mutations,
acting at the gene level, may therefore abolish the functions of several
proteins at once. Splice variants are represented by different transcripts so
it should be possible to distinguish them by RNA interference and transcriptome
analysis, but some transcripts give rise to multiple proteins whose individual
functions cannot be studied other than at the protein level.
·
Protein
activity often depends on post-translational modifications, which are
not predictable from the level of the corresponding transcript. Many proteins
are present in the cell as inert molecules, which need to be activated by
processes such as proteolytic cleavage or phosphorylation. In cases where
variations in the abundance of a specific post-translational variant are significant,
this means that only proteomics provides the information required to establish
the function of a particular protein.
·
The function of
a protein often depends on its localization. While there are some examples of
mRNA localization in the cell, particularly in early development, most
trafficking of gene products occurs at the protein level. The activity of a
protein often depends on its location, and many proteins are shuttled between
compartments (e.g. the cytosol and the nucleus) as a form of regulation. The
abundance of a given protein in the cell as a whole may therefore tell only
part of the story. In some cases, it is the distribution of a protein rather
than its absolute abundance that is important.
Comments
Post a Comment