Using DSP to look at Genes

1. Looking for globins.

Hemoglobin is the protein in your blood that carries oxygen.  Like other proteins, the code for creating hemoglobin is in your DNA.  (DNA is a long molecule composed of four types of nucleotides, adenine, cytosine, guanine and thymine - or 'a', 'c', 'g' and 't.')  The file humhbb.txt contains the Genbank record of the area of the human genome that codes for hemoglobin.  It is shown below - scan towards the bottom to see the coded sequence.

There are several types of hemoglobin that are represented in this sequence (go to the part of the file labeled "FEATURES" and you will see where the coding for the various globins are located in the gene).


Your task is simple.  The files seqa.txt and seqb.txt have the sequences for beta-globin from a frog and from a mouse; the problem is, you don't know which is which.  Your task is to determine which one is from a frog (Xenopus Tropicalis - a carnivorous African frog with claws on its toes) and which is from a mouse (Mus musculus - the house mouse).




You may find the following code useful (ReadGenBank.m).  It opens a Genbank file, finds the location of the DNA sequence, and then creates a vector with a "1" in it everywhere the DNA sequence has an "a."  You can do likewise for "c", "g" and "t."


When you present your results:

2. Looking at AIDS.

The file hivmn.txt holds the Genbank sequence for the HIV virus.  The files seq1.txt and seq2.txt hold the sequences for another strain of HIV and one of SIV (Simian Immunodeficiency Virus - monkey AIDS).   Which is which?


When you present your results:

3. To think about and possibly implement.

Can you think of ways to reduce the "noise" without greatly decreasing the "signal" (i.e., so that the signal to noise ratio (SNR) increases?  Show that it works (or doesn't).  What limits how effective your technique is?  10 points)