Genomics is the study of biological problems by using genetic information derived from whole genomic sequences rather than from studying a small number of genes (as done in conventional genetics). All of the information for determination of physical characteristics, disease susceptibility, and certain behavioral traits resides in the nucleotide sequence making up the genome of an organism. Through the use of automated DNA sequence analysis, it has been possible to determine the nearly complete genomic sequence of a number of microbial, animal, and plant species including man over the past 10 years. Two of the most remarkable discoveries of the genomics era are the observation that even complex organisms have relatively few protein coding genes (20-30,000) and that many encode products that are structurally and functionally conserved among all organisms. Genomic information has allowed investigators to begin studying groups of gene products that work together to determine particular biochemical pathways and phenotypes. This field is called functional genomics or proteomics and will be a major focus of scientific endeavor for at least the next decade. Since techniques exist that allow investigators to add or subtract specific genes in model organisms such as yeast, arabidopsis, fruit fly, zebra fish, and mouse, rapid progress is being made toward understanding the function of all genes and interactions between their protein products.
Bioinformatics
is an emerging discipline that uses computational techniques to present and
analyze genomic and proteomic information.
The vast amount of information contained in a complex eukaryotic genome
(~3 X 109 bp) is too great for the human mind to comprehend without
the assistance of a computer. Computer
databases are used to store the large amounts of sequence information that is
now available and to construct linear genomes from overlapping short sequences
that are produced from automated sequencing of individual cloned DNA
fragments. Programs exist that allow
anyone to rapidly compare an unknown DNA sequence with all of the available
information contained in a large government supported database known as
GenBank. For example, the sequence of a
PCR fragment derived from a mutant form of a human gene that is associated with
a particular genetic disease can be compared with GenBank sequences to
determine the wild type gene. GenBank
files contain specific contiguous DNA sequences with information detailing open
reading frames that putatively code for proteins. If the sequence is known to code for specific mRNA molecules, the
start and stop points and introns are noted and the open reading frames are
virtually translated into protein sequences using the genetic code. GenBank file also often contain information
about the function of the gene product(s) and links to literature sources
detailing the original studies that have been done on the gene. Separate data bases exist that catalog all
known protein sequences and structures.
It has been learned that many different proteins share common structural
features that are the result of evolutionary conservation of functional
elements. For example, nucleotide-binding
domains can often be determined within proteins encoded by newly discovered
genes through similarity with other proteins that are known to bind
nucleotides. Similarly, much can be
“guessed” about the structure of a protein encoded by a newly discovered gene
by comparing the computed sequence with that of other proteins whose structure
have been solved by X-ray crystallography.
Proteomics tools exist that can rapidly detect putative sites for
protein modifications such as phosphorylation, glycosylation, and proteolytic
cleavage. Signal peptides, membrane
spanning domains, and overall protein stability can also be predicted using
only primary DNA sequence information as a starting point.
In this exercise, you will analyze a random sequence of yeast DNA that was cloned in Exercise #5 using on-line genomics and proteomics tools. Most programs used for genetic analysis are proprietary and quite expensive. However, some individual programs exist in the public domain. Curagen, a for-profit Bioinformatics company, has compiled many of these free programs into a suite called GeneScape. You can use the GeneScape programs simply by registering a username with the company and receiving a password by e-mail. We will practice using the GeneScape suite during the class period using example DNA sequences. The yeast DNA sequence that you will analyze for your lab report will be sent to you by e-mail so that you can work on your own time. All of the sequences that will be sent out for this exercise have been pre-screened by your instructors and are known to contain at least part of a protein-coding region. You will determine the identity of the unknown DNA fragment by comparison to the yeast genome. You will generate a virtual restriction enzyme cleavage map of the DNA sequence. The entire protein coding region of the gene represented by your cloned fragment will be collected and translated to identify all open reading frames. After selecting the open reading frame that is most likely to code for the protein of interest, the protein will then be analyzed for known structural features and modification sites. Finally, the protein sequence will be compared with the SwissProt protein database. Aligning the sequence with the most similar proteins in the database will compare the relationship of the yeast protein to similar proteins from other organisms.
Before you begin, it might be helpful to copy the DNA sequence for the unknown gene from the e-mail message and paste it into a text editor or word processing program. That way you can always have the sequence available for pasting into other programs. Do the same thing with the derived protein sequence when you reach that step.
Step
I. Log onto the GeneScape Portal and Add Your Unknown Sequence
1)
Go to GeneScape ( Note: don’t
type “www”.
Log in or register. Choose a username and you will immediately be sent a password. Log in using your password. You can change the password if you like to something easy to remember.
2)
Accept the agreement and
the Curatools Analysis Page should open.
3)
Click “Add a Sequence to
Analyze”.
4)
Enter a name for your
sequence (We suggest using the name that you were given in the e-mail such as
M1, T2, F2, etc.)Paste a copy of your unknown sequence into the sequence box
and click “Add”.
Step
II. Conduct a BLAST Similarity Search
to Identify the Cloned Fragment
5)
Under the heading DNA
Analysis, click “ Blast, * ” (Note * means any text)
6)
“Under DNA Analysis,
Similarity Search, click “DNA Curablastntm Search” and click “Run”.
7)
The results of your
search will load after a few minutes depending on how busy the serve is that
day.
You
will be presented with a number of sequences from GenBank that match your
unknown sequence most closely in descending order. Below the list of sequences are the homologous regions of each
GenBank sequence aligned with your sequence.
8)
Click and view the most
similar sequences. You will notice that
the GenBank formatted sequence contains much information about the gene or
genes contained in the sequence. Add
the 4 most similar sequences to the analysis list. Their sequence names should follow the name of your own
sequence. Any of these sequences can be
selected for analysis at any time.
9)
Print out the
list of similar sequences and the first 4 sequence alignments for your lab
report.
Read
as much information regarding BLAST searches as you can (also check out the
NCBI home page and look up BLAST; (http://www.ncbi.nlm.nih.gov/BLAST/) and
then answer the following:
Questions
1)
What is the E-value for the "hit" with the highest degree of
similarity to the query sequence? What does this value mean?
2). Is your cloned sequence IDENTICAL to any
sequences in GenBank? What are the
differences between your sequence and the yeast sequences in GenBank? What are possible sources of discrepancies
between your sequence and yeast sequences contained in GenBank?
3). What genes does your cloned sequence
represent?
4) What is the putative function of the gene
if know?
5) Does your cloned sequence share significant
similarity with any non-yeast genes?
Are their functions similar?
Step
III. Restriction Mapping
10)
On the DNA Analysis
page, select the yeast gene sequence that most closely matches your cloned
sequence. If you checked the GenBank file
when you viewed it, the sequence should
be on your list. Alternatively, you can copy the nucleotide sequence (with
no additional text) from the GenBank file and past it into the DNA analysis box
using the “Add” function on the DNA Analysis page.
11)
Under Sequence
Manipulation, check and run the Restriction Analysis program.
12)
Print the linear
map indicating the positions cleaved by all enzymes that recognize 6 bp
sequences. Does Eco RI or Hind III cut
the sequence? How many times?
Step
IV. Protein Analysis
With the yeast gene sequence
that most closely matches your cloned sequence still selected, select and run
the Robot DNA Translation program from the Analysis page.
13)
Print out all 6
possible protein reading frames for your report.
14)
Click on the correct
reading frame link to display the translated protein sequences.
15)
Click on the first
methionine residue of the longest open reading frame. This will cause the protein sequence to be displayed as a
SWISS-PROT (a database format) virtual translation product.
16)
Scan the sequence with
the Prosite program for protein recognition pattern predictions. Print out the predicted recognition
sequences for your report.
17)
Use the “Back” control
to return to the Prosite sequence. Run
the Prot Param program to determine the predicted parameters for # of amino
acid residues, MW, iso-electric point (PI), amino acid composition, etc. Print this information out for your
report.
18)
Use the “Back”
control to return to the Prosite sequence.
Select the NCBI BLAST program and run BLAST P to compare your translated
sequence to all of the know protein sequences.
Print out the list of all related protein sequences for your report.
19)
Click the accession
numbers of the proteins with the greatest similarity to your protein and read
as much information as possible about each of the related proteins in order to
learn about possible functions of the protein.
You might check out the literature links as well from both the GenBank
and SWISS-PROT pages. Include this
information in the answer to question #4.
6.
What is an open reading frame (ORF)?
7.
For any given DNA sequence, how many possible reading frames are there? Explain.
8.
Which reading frame most likely contains the open reading frame for the unknown
protein? How can you tell (think about start and stop codons)?
9)
Based on the information you have gathered from your analysis, what is the most
likely function of the protein represented by the gene fragment you have
cloned?
10)
Based on what you have learned about the function and sub cellular localization
of the protein, are any of the protein localization or modification sites
determined by Prosite likely NOT to be used?