Spring 2019

The questions below are due on Sunday February 24, 2019; 11:59:00 PM.
You are not logged in.

If you are a current student, please Log In for full access to the web site.
Note that this link will take you to an external site (https://oidc.mit.edu) to authenticate, and then you will be redirected back to this page.

Back to Exercise 03

State Machines in my RNA

We're going to start working on the server this week, and the language of choice for that is Python so to get some prep work in let's write a state machine in Python.

1) The Setup

In this exercise, we will model a biological process using state machines. In all biological cells, mRNA (messenger RNA) is translated into protein. mRNA can be modeled as strings of bases, each having one of four values—A, C, G, or U.

Molecular machines called ribosomes are responsible for creating proteins from mRNA. Ribosomes start by reading in the sequence of bases on the mRNA and looking for Open Reading Frames (ORFs). ORFs are sequences within the mRNA string that start with the start triplet, 'AUG' (inclusive) and end with one of three stop triplets: 'UAA', 'UAG', or 'UGA' (exclusive).
Within the ORF, the ribosome reads the bases in sequential triplets (called codons). Each codon (there are 64 possible4^{3}) corresponds to one of 20 amino acids or one of the three stop codons mentioned above. The ribosome uses molecules called tRNAs (which act like a molecular Python dictionary) to map codons to amino acids. The ribosome continues to read triplets and output amino acids until it reads a STOP codon, after which it continues reading but does not translate codons into amino acids. Regions of the mRNA both before and after the ORF are not translated.

The RNA State Machine (Click here to open this image in a new tab)

2) RibosomeSM

Define a new Python function1 ribosomeSM to model the process of protein synthesis. This function will take in one argument, a Python string representing an piece of mRNA, with each letter representing a separate base ('A' for Adenine, 'C' for Cytosine, 'G' for Guanine, or 'U' for Uracil).

The function should return as output a Python list of which the entries are either None when there is no output, or strings representing amino acids. When the state machine begins, every new nucleotide read-in should result in an appended None. When a start sequence is encountered, the system should only append amino acids using the lookup table below (note that means when in an ORF, the list will be extended one-third as often as it is when not in an ORF; for every third base pair instead of every one). Once a stop codon is reached, the state machine should continue scanning (and adding None for every new base). The function should return the list it generates when completed.

If a full triplet has not been read in (recall that we are reading in bases one at a time), the state machine should output None.

See the example output below:

>>> result = ribosomeSM(testSeq)
>>> print [i for i in result if i is not None]
['Methionine', 'Valine', 'Serine', 'Lysine', 'Glycine', 'Glutamic Acid',
'Glutamic Acid', 'Leucine']
We will use the following lookup table to map codon triplets to amino acids:
rnaToAminoAcid = {"UUU":"Phenylalanine", "UUC":"Phenylalanine", "UUA":"Leucine", 
    "UCU":"Serine", "UCC":"Serine", "UCA":"Serine", "UCG":"Serine",
    "UAU":"Tyrosine", "UAC":"Tyrosine", "UAA":"STOP", "UAG":"STOP",
    "UGU":"Cysteine", "UGC":"Cysteine", "UGA":"STOP", "UGG":"Tryptophan",
    "CUU":"Leucine", "CUC":"Leucine", "CUA":"Leucine", "CUG":"Leucine",
    "CCU":"Proline", "CCC":"Proline", "CCA":"Proline", "CCG":"Proline",
    "CAU":"Histidine", "CAC":"Histidine", "CAA":"Glutamine", "CAG":"Glutamine",
    "CGU":"Arginine", "CGC":"Arginine", "CGA":"Arginine", "CGG":"Arginine",
    "AUU":"Isoleucine", "AUC":"Isoleucine", "AUA":"Isoleucine", "AUG":"Methionine",
    "ACU":"Threonine", "ACC":"Threonine", "ACA":"Threonine", "ACG":"Threonine",
    "AAU":"Asparagine", "AAC":"Asparagine", "AAA":"Lysine", "AAG":"Lysine",
    "AGU":"Serine", "AGC":"Serine", "AGA":"Arginine", "AGG":"Arginine",
    "GUU":"Valine", "GUC":"Valine", "GUA":"Valine", "GUG":"Valine",
    "GCU":"Alanine", "GCC":"Alanine", "GCA":"Alanine", "GCG":"Alanine",
    "GAU":"Aspartic Acid", "GAC":"Aspartic Acid", "GAA":"Glutamic Acid", 
    "GAG":"Glutamic Acid", "UUG":"Leucine",
    "GGU":"Glycine", "GGC":"Glycine", "GGA":"Glycine", "GGG":"Glycine",}
Note that the stop codons map to the string 'STOP' in this dictionary, which may simplify looking them up. While you don't have to code this as a state machine, it is a productive way to approach this problem and we strongly encourage it.
While this might seem a very non-EECS thing to code up, the parsing mechanism that ribosomes carry out on RNA is very similar to the parsing we'll need to do on incoming data streams later in the semester (or really anytime you parse structured data), so it is a good exercise to mess with.
Enter your code for the ribosomeSM function below:
def ribosomeSM(input): pass
Back to Exercise 03


1Finally, some Python in 6.08! (click to return to text)

This page was last updated on Tuesday February 19, 2019 at 03:09:56 PM (revision 029e103).
Course Site powered by CAT-SOOP 14.0.4.dev5.
CAT-SOOP is free/libre software, available under the terms
of the GNU Affero General Public License, version 3.
(Download Source Code)
CSS/stryling from the Outboxcraft library Beauter, licensed under MIT
Copyright 2017