Saturday, July 8, 2017

organic chemistry - Pubchem, InChI, SMILES, and uniqueness





  1. PubChem compound 6140 is L-phenylalanine in its neutral (not zwitterionic) form. According this PubChem, this molecule has the following SMILES and InChI indentifiers:



    • SMILES: C1=CC=C(C=C1)CC(C(=O)O)N

    • InChI: InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12)/t8-/m0/s1




  2. PubChem compound 6925665 is the zwitterionic form of L-phenylalanine (protonated amine and deprotonated carboxylate). Pubchem has decided that this species should be called "(2S)-2-azaniumyl-3-phenylpropanoate". The SMILES and InChI identifiers are:




    • SMILES: C1=CC=C(C=C1)CC(C(=O)[O-])[NH3+]

    • Inchi: InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12)/t8-/m0/s1




  3. Confusion. My confusion is why these different compounds have different PubChem entries (CIDs), and different SMILES identifiers...but the same InChI structure. The different SMILES identifiers each appear to reflect the respective structures displayed by PubChem, but the single InChI identifier given for both compounds seems to reflect only the neutral form. I even put the InChI into rdkit and converted it back to InChI. The result was the same and rdkit interpreted this InChI as the neutral (not zwitterionic) species. What is the reason for this discrepancy between InChI duplicity yet structural uniqueness?




Here are some possibilities:




  1. PubChem is in error. They should change the InChI for the zwitterionic compound. (If so , to what?)

  2. PubChem is right but is using definitions of compound and CID that are different than mine.

  3. I have some kind of fundamental misunderstanding of the purpose of InChI, which I had thought would uniquely specify a molecular structure. But InChI is designed to handle ambiguities like zwitterion vs. neutral.

  4. rdkit interprets InChIs weird.

  5. Something else.



Answer



Unfortunately Pubchem is right, the two structures have the same InChI string and key, since the protonation state is the same in the zwitterion and the neutral form. So the reason for the discrepancy is by design.
I also always thought, InChI was designed for distinguishing between these conformations, but it turns out to just be one of the limitations of the system. The issue is addressed in section 13.2 of the technical FAQ of the InChI trust:




The different protonation states of the same compound will have InChIKeys differing only by the protonation indicator (unless both states have a number of inserted/removed protons greater than 12; in this case the protonation flag will also be the same, ‘A’).
This is exemplified below by standard InChIKeys as well as standard InChI strings for neutral, zwitterionic, anionic and cationic states of glycine (note that neutral and zwitterionic states do not differ in the total number of protons so they have the same standard InChI/InChIKey):
InChI for glycin



No comments:

Post a Comment

periodic trends - Comparing radii in lithium, beryllium, magnesium, aluminium and sodium ions

Apparently the of last four, $\ce{Mg^2+}$ is closest in radius to $\ce{Li+}$. Is this true, and if so, why would a whole larger shell ($\ce{...