a coding theorem and genetics

Here is a nice result from an information theory book:

(A code is uniquely decipherable if every finite sequence of code characters corresponds to at most one message. If no code word is a prefix of another code word then the code is also called instantaneous. Although a uniquely decipherable code is not necessarily instantaneous, the existence of such a code implies the existence of an instantaneous one.)

Let's try to apply this theorem to genetics...

The genetic code alphabet consists of only four letter: A,C,G,U. Therefore we have D=4 in the above formula.

We need to assign a code word to each of 20 existing amino acids: Alanine, Arginine, Asparagine, Aspartic Acid, Cysteine, Glutamic Acid, Glutamine, Glycine, Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Proline, Serine, Threonine, Tryptophan, Tyrosine, Valine. Therefore we have M=20 in the above formula.

In vertebrates the relative observed frequencies of the above amino acids are respectively 7.4, 4.2, 4.4, 5.9, 3.3, 5.8, 3.7, 7.4, 2.9, 3.8, 7.6, 7.2, 1.8, 4.0, 5.0, 8.1, 6.2, 1.3, 3.3 and 6.8 percent. (Note that these numbers sum up to 100.) In other words, 7.4 percent of the amino acids needed for the next typical protein synthesis will be Alanine. 4.2% of them will be Arginine, 4.4% will be Asparagine etc. This vector of percentages will be our probability variables "p_i". (e.g. x_2=Arginine and p_2=0,042)

Let's enforce the additional requirement that the length of each code-word is equal. In other words, for all "i" we set "n_i" equal to some "n". (There may be some structural justifications for this extra assumption. I will not make any speculations though since my knowledge of molecular biology is close to nil.)

After inserting the numbers into their appropriate places, the theorem reveals that the lower bound for the probability-weighted average of code-word lengths is 2.1. Since we set all "n_i" equal to "n", the probability-weighted average is simply "n". Note that code-word length "n" has to be an integer. Hence what noiseless coding theorem tells us is that the minimum value "n" can assume is 3.

Guess the actual word-length that occurs in genetics! It is 3. If there was a single more letter in the code alphabet, then "n" could be 2! (Due to the anti-parallel structure of DNA the alphabet size has to remain even. Therefore I should have probably written "If there were two more letters...".) However, with only four letters, the word-length can not theoretically be pushed below 3.

Given the size of the alphabet, nature is as efficient as it can theoretically get. It even opportunistically exploits the difference between 2.1 and 3 by assigning more than a single codon (namely a three letter code) to some of the amino acids. The number of code-representations belonging to each amino acid is repectively 4, 6, 2, 2, 2, 2, 2, 4, 2, 3, 6, 2, 1, 2, 4, 6, 4, 1, 2 and 4.

There are some suggestions that the genetic code has evolved in an error-minimizing fashion. When a single-nucleotide mutation transforms UUU into UUC, tRNA still codes for Phenylalanine. Hence, in some sense, greater number of representations entails less sensitivity to mutations and operational errors during the decoding process.

When you compare the number of representations against the observed frequency of occurrence, the following pattern emerges:

Note that if you exclude the outlier Arginine, then the correlation becomes 0.78. Here are two possible explanations of this high correlation:

1) Assuming that the observed frequency distribution is a reliable indicator of the relative importance of each amino acid, the high correlation ensures that the formation of important amino acids is less affected by mutations and decoding mistakes.

2) Amino acids whose production is more resistant to random shocks (e.g. mutations and decoding mistakes) will sooner or later overpopulate the weaker ones.

Notice the striking similarity with the following graph:

Here expected frequency depicts the relative frequency of amino acids that one would get by a random, serially independent juxtaposition of the available bases (A,C,G,U) in the DNA. For example, since Glutamine is coded by CAA and CAG, its expected frequency in the metabolism is W*[(C%*A%*A%)+(C%*A%*G%)] where X% is the frequency of base X in the DNA and W (larger than 1) is a factor that corrects the addition for the presence of stop codons in the DNA.

If you exclude the outlier Arginine, then correlation in the above graph becomes 0.89. In other words, the observed frequency is remarkably in line with what would have happened if the transcription mechanism lacked any structure and acted randomly on DNA.

Two cases come to my mind:

1) What happens if the number of codons coding amino acid X increases (decreases) and if none of A%,C%,G% and U% change? Each of the 64 (=4*4*4) mathematically possible codons are either assigned to an amino acid or to a "stop" signal. Therefore, as the number of codons that code X increases (decreases), the number of codons assigned to some other amino acids needs to decrease (increase). The expected frequency of X will increase (decrease) while that of others will decrease (increase).

2) What happens if A%,C%,G% or U% change due to mass mutation and the code remains the same? Then the whole expected frequency distribution will change accordingly.

If the second graph depicts a causal relationship, then in each of the two cases above, observed frequencies will soon align themselves with the new expected frequencies. What happens if this development endangers the stability and survival of the metabolism? Will any dynamic mechanisms kick in and undo part of the mass mutation or switch the code-words around so that the expected frequencies remain as before?

One final question: What is wrong with Arginine?