Sunday, January 22, 2017

pronunciation - Is it possible to algorithmically convert Japanese text to Romaji?

I am a programmer and I recently wrote two browser extensions, one will translate English text to "pronunciation" replacing all words founf on a web page, and another will replace all Chinese characters with their Pinyin-with-tone-marks counterparts. Now I am thinking if something similar is possible to Japanese? If yes, what would be the approach? Anything can be done algoritmically or you can only use dictionary? Or even dictionary is not keeping in mind that the same characters can sound differently based on context? In other words can you do this conversion without actually attempting to do a machine semantic translation?


Basically this is very difficult.

Real Japanese sentences on the net are mixture of kanji, hiragana, katakana and English alphabet. See Japanese writing system on Wikipedia.

Among these, hiragana and katakana are almost "pronunciation symbols" themselves. You can replace them into romaji using this table and you're 80% done. The remaining 20% is a bit tricky but they can be handled algorithmically. Still, there are various romanization systems, so you have to make a wise decision. There are also some "extended katakana" which may not be transliterated straightforwardly.

Kanji is the difficult part. Character-based replacement makes no sense because one kanji can be read differently in different words, and there are many jukujikun's. So you absolutely need a dictionary of some sort, but even with a dictionary, they are difficult for some reasons.

  1. Japanese sentences are written without any spaces, so you cannot determine word boundaries with simple regular expressions. You need a dedicated morphological analyzer for this purpose, for example this and this (I have not tested them). Note that analyzers are not perfect.

  2. Sometimes the exact same word or phrase can be read differently depending on the context, although English has a similar problem, e.g., "minute", "read", "wind". See: Difference between こんにち and きょう

  3. Some uncommon words (especially proper nouns) are not on any dictionaries, but you still have to make a "reasonable guess" on them. Of course English has the same problem in this regard, but the algorithm for doing this in Japanese might be more complicated.

No comments:

Post a Comment

periodic trends - Comparing radii in lithium, beryllium, magnesium, aluminium and sodium ions

Apparently the of last four, $\ce{Mg^2+}$ is closest in radius to $\ce{Li+}$. Is this true, and if so, why would a whole larger shell ($\ce{...