nomenclature - Where did amino acids get their one-letter codes?

Tuesday, October 9, 2018

nomenclature - Where did amino acids get their one-letter codes?

Some amino acids have a one-letter code that's just the first letter of the name of the amino acid. This makes sense and obviously, since there is more than one amino acid that begins with the same letter, other letters had to be used.

But why were these other letters chosen? For instance, is there a reason 'W' specifically was chosen for tryptophan (other than the fact that 'T' was taken)?

Answer

Tryptophan

For instance, is there a reason 'W' specifically was chosen for tryptophan (other than the fact that 'T' was taken)?

Once you have assigned the other 19 amino acids, there are only 7 letters of the alphabet left: B, J, O, U, W, X, and Z. (Certainly not a nice Scrabble hand to have!)

If one wants to use a letter found within the name of the amino acid, the only available letter would be O. However, the usage of U and O was historically discouraged because these letters could be easily confused with other letters (U with V; O with G, Q, C, D, and the number 0).

It turns out that the choice was made because W is a very fat letter and was reminiscent of the indole ring system present in tryptophan (the only amino acid to contain a bicyclic system). (See below for source.)

What happened to the other six letters, then?

All 26 letters of the alphabet now find use as a one-letter code for amino acids or various combinations thereof.

After their discoveries, selenocysteine and pyrrolysine (the latter only found in bacteria) were assigned U and O respectively. Furthermore, B is used to represent aspartic acid OR asparagine; J is used to represent leucine OR isoleucine; Z is used to represent glutamic acid OR glutamine; and X is used to represent an unknown amino acid.

I believe B and Z find use because, in protein sequencing, acid hydrolysis is often used to break peptide bonds. This has the undesirable side-effect of hydrolysing the amide groups in asparagine/glutamine, leading to the formation of aspartic/glutamic acids, which means that one cannot tell exactly which amino acid it was at the start. J is used in NMR spectroscopy where isoleucine and leucine are difficult to distinguish.

Why did they choose the letters they did?

As far as I am aware the usage of one-letter symbols is adopted by both IUPAC and IUB (since 1991, IUBMB) in their joint 1983 recommendations on "Nomenclature and Symbolism for Amino Acids and Peptides".¹ ankit7540's concise summary of the historical development already mentioned these recommendations.

In particular, Section 3AA-21.2 "The Code Symbols" has a description of why the letters were chosen. This document is probably the most authoritative stance on the matter. The rationale is mostly in line with Jan's answer:

Initial letters of the names of the amino acids were chosen where there was no ambiguity. There are six such cases: cysteine, histidine. isoleucine, methionine, serine, and valine. All the other amino acids share the initial letters A, G, L, P or T, so arbitrary assignments were made. These letters were assigned to the most frequently occurring and structurally most simple of the amino acids with these initials, alanine (A), glycine (G), leucine (L), proline (P) and threonine (T).

Other assignments were made on the basis of associations that might be helpful in remembering the code, e.g. the phonetic associations of F for phenylalanine and R for arginine. For tryptophan the double ring of the molecule is associated with the bulky letter W. The letters N and Q were assigned to asparagine and glutamine respectively; D and E to aspartic and glutamic acids respectively. K and Y were chosen for the two remaining amino acids, lysine and tyrosine, because, of the few remaining letters, they were close alphabetically to the initial letters of the names. U and O were avoided because U is easily confused with V in handwritten material, and O with G, Q, C and D in imperfect computer print-outs, and also with zero. J was avoided because it is absent from several languages.

Two other symbols are often necessary for partly determined sequences, so B was assigned to aspartic acid or asparagine when these have not been distinguished; Z was similarly assigned to glutamic acid or glutamine. X means that the identity of an amino acid is undetermined, or that the amino acid is atypical.

One can only hypothesise what they meant by "associations that might be helpful in remembering the code" in the case of N/Q/D/E. My best guess is:

D and E were possibly chosen for aspartic and glutamic acids because they were the only consecutive pair of letters left, emphasising their chemical similarity. Aspartic acid is shorter than glutamic acid by one methylene group (CH₂), so it gets the earlier letter D.

Glutamine sounds like Q-tamine. If you don't think it sounds similar, repeat it 50 times until you do.

AsparagiNe was assigned N.

Reference

IUPAC-IUB Joint Commission on Biochemical Nomenclature. Nomenclature and Symbolism for Amino Acids and Peptides: Recommendations 1983. FEBS J. 1984, 138 (1), 9–37. DOI: 10.1111/j.1432-1033.1984.tb07877.x. A HTML version (perhaps more user-friendly) can be found at this address.

Notes

Tuesday, October 9, 2018