Programming tips with Japanese Language/Characters

Question

I have an idea for a few web apps to write to help me, and maybe others, learn Japanese better since I am studying the language.

My problem is the site will be in mostly english, so it needs to mix fluently Japanese Characters, usually hirigana and katakana, but later kanji. I am getting closer to accomplishing this; I have figured out that the pages and source files need to be unicode and utf-8 content types.

However, my problem comes in the actual coding. What I need is to manipulate strings of text that are kana. One example is:

けす I need to take that verb and convert it to the te-form けして. I would prefer to do this in javascript as it will help down the road to do more manipulation, but if I have to will just do DB calls and hold everything in a DB.

My question is not only how to do it in javascript, but what are some tips and strategies to doing these kinds of things in other languages, too. I am hoping to get more into doing language learning apps, but am lost when it comes to this.

Are you looking at some form of stemming? Pardon my ignorance, but this looks harder (for a logographic language) than what you would do for a regular alphabet based language. — dirkgently, May 02 '09 at 18:04
No not stemming in the example the root word is basically けす but I am changing the す to し and adding て. Another example is のむ changing the　む to んで to get のんで. An easier example might just be.　たべる which you would drop the る and add て to get たべて. Hopefully this makes more sense. — Buddy Lindsey, May 02 '09 at 18:09
Your examples (strangely!) give me a notion (which I am sure is wrong) that all you want is some string replacemnet. Even (Unicode) regex would work. — dirkgently, May 02 '09 at 18:14
Are you trying yo write Japanese automatically? I mean, that the software decides what must be written, applying some Japanese syntax rules? — Daniel Daranas, Jul 03 '09 at 11:52
Be careful that some uncommon Kanji outside the Unicode Basic Multilingual Plane require two JavaScript "characters". You only have to worry about these if some of your code tries to deal with the individual characters that make up strings. Some of the most used non-BMP in the Japanese WIkipedia are: * , , , , , , , , , , , , , , , , , , , , — hippietrail, Apr 26 '11 at 14:41

score 26 · Answer 1 · edited Dec 08 '16 at 21:19

Stick to Unicode and utf-8 everywhere.
Stay away from the native Japanese encodings: euc-jp, shiftjis, iso-2022-jp, but be aware that you'll probably encounter them at some point if you continue.
Get familiar with a segmenter for doing complicated stuff like POS analysis, word segmentation, etc. the standard tools used by most people who do NLP (natural language processing) work on Japanese are, in order of popularity/power.

MeCab (originally on SourceForge) is awesome: it allows you to take text like,

「日本語は、とても難しいです。」

and get all sorts of great info back

kettle:~$ echo 日本語は、難しいです | mecab 
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
は   助詞,係助詞,*,*,*,*,は,ハ,ワ
、   記号,読点,*,*,*,*,、,、,、
難しい 形容詞,自立,*,*,形容詞・イ段,基本形,難しい,ムズカシイ,ムズカシイ
です  助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

which is basically a detailed run-down of the parts-of-speech, readings, pronunciations, etc. It will also do you the favor of analyzing verb tenses,

kettle:~$ echo メキシコ料理が食べたい | mecab 
メキシコ    名詞,固有名詞,地域,国,*,*,メキシコ,メキシコ,メキシコ
料理  名詞,サ変接続,*,*,*,*,料理,リョウリ,リョーリ
が   助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ  動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい  助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
EOS

However, the documentation is all in Japanese, and it's a bit complicated to set up and figure out how to format the output the way you want it. There are packages available for ubuntu/debian, and bindings in a bunch of languages including perl, python, ruby...

Apt-repos for ubuntu:

deb http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
deb-src http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all

Packages to install: $ apt-get install mecab-ipadic-utf8 mecab python-mecab

should do the trick I think.

The other alternatives to mecab are, ChaSen, which was written years ago by the author of MeCab (who incidentally works at google now), and Kakasi, which is much less powerful.

I would definitely try to avoid rolling your own conjugation routines. the problem with this is just that it will require tons and tons of work, which others have already done, and covering all the edge cases with rules is, at the end of the day, impossible.

MeCab is statistically driven, and trained on loads of data. It employs a sophisticated machine learning technique called conditional random fields (CRFs) and the results are really quite good.

Have fun with the Japanese. I'm not sure how good your Japanese is, but if you need help with the docs for mecab or whatever feel free to ask about that as well. Kanji can be quite intimidating at the beginning.

I wish I could mark this as an answer too. :( Thanks for the great information. I was only going to do my own conjugation routines as a programming exercise and to better learn the core around japanese langauge. If i get further into Japanese I will definitely take a look at a segmenter. Thanks. — Buddy Lindsey, May 05 '09 at 02:52
Stumbled on MeCab while playing around with C#. Just wanted to add that it's awesome. There's also a MeCab webservice @ http://mimitako.net/api/mecapi.cgi . Oh and "unofficial" C# bindings @ http://en.sourceforge.jp/projects/mecabdotnet/ . Cheers! — Maiku Mori, Dec 26 '09 at 23:46
I cross-compiled MeCab to Javascript so it runs in a browser http://fasiha.github.io/mecab-emscripten/ and this post was super-helpful in understanding just what a useful tool it is! — Ahmed Fasih, Dec 07 '14 at 05:30

Michael Borgwardt · Accepted Answer · 2009-05-05T08:06:42.217

My question is not only how to do it in javascript, but what are some tips and strategies to doing these kinds of things in other langauges too.

What you want to do is pretty basic string manipution - apart from the missing word separators, as Barry notes, though that's not a technical problem.

Basically, for a modern Unicode-aware programming language (which JavaScript has been since version 1.3, I believe) there is no real difference between a Japanese kana or kanji, and a latin letter - they're all just characters. And a string is just, well, a string of characters.

Where it gets difficult is when you have to convert between strings and bytes, because then you need to pay attention to what encoding you are using. Unfortunately, many programmers, especially native English speakers tend to gloss over this problem because ASCII is the de facto standard encoding for latin letters and other encodings usually try to be compatible. If latin letters are all you need, then you can get along being blissfully ignorant about character encodings, believe that bytes and characters are basically the same thing - and write programs that mutilate anything that's not ASCII.

So the "secret" of Unicode-aware programming is this: learn to recognize when and where strings/characters are converted to and from bytes, and make sure that in all those places the correct encoding is used, i.e. the same that will be used for the reverse conversion and one that can encode all the character's you're using. UTF-8 is slowly becoming the de-facto standard and should normally be used wherever you have a choice.

Typical examples (non-exhaustive):

When writing source code with non-ASCII string literals (configure encoding in the editor/IDE)
When compiling or interpreting such source code (compiler/interpreter needs to know the encoding)
When reading/writing strings to a file (encoding must be specified somewhere in the API, or in the file's metadata)
When writing strings to a database (encoding must be specified in the configuration of the DB or the table)
When delivering HTML pages via a webserver (encoding must be specified in the HTML headers or the pages' meta header; forms can be even more tricky)

Actually after reading this and talking to a friend I tried to do basic string manipulation again based on the "everything is a string" and it worked. I have no idea what I was doing that killed the first attempt at it, but I am glad it was that easy and feel dumb for it not working the first time. Thanks for the response. — Buddy Lindsey, May 05 '09 at 02:49

score 2 · Answer 3 · answered May 04 '09 at 04:55

What you need to do is to look at the rules of grammar. Have an array of rules for each conjugation. Let's take 〜て form for example. Psudocode :

def te_form(verb)
  switch verb.substr(-1, 1) == "る" then return # verb minus ru plus te
  case "る" #return (verb - る) + て
  case "す" #return (verb - す）＋して

etc. Basically, break it down into Type I, II and III verbs.

Berry Tsakala · Answer 4 · 2009-05-03T09:45:11.750

1

your question is totally unclear to me.

however, i had some experience working with japanese language, so i'll give my 2 Cents.

since japanese texts do not feature word separation (e.g. space character), the most important tool we had to acquire is a dictionary-based word recognizer.

once you got the text split, it's easier to manipulate it with "normal" tools.

there were only 2 tools which did the above, and as a by-product they also worked as a tagger (i.e. noun, verb, etc.).

edit: always use unicode when working w languagers.

edited May 03 '09 at 09:45

answered May 02 '09 at 20:00

Berry Tsakala

15,313
12
57
80

Sorry, My question is kind of two things in one. I was nervous to start 2 different topics so I combined a "What are some tips to work with Japanese language" and "How can I accomplish xyz". Are there any more tips you can offer with your experience anything would be great. I had not thought about sperating out words, hadn't gotten that far. Mostly am after how to manipulate individual words. However, any tips on programming with the japanese langauge is helpful and appreciated. To be honest I was trying to avoid mapping files an unicode, but looks like need to use either or both. – Buddy Lindsey May 02 '09 at 20:56

score 0 · Answer 5 · answered May 02 '09 at 18:39

If I recall correctly (and I slacked off a lot the year I took Japanese so I could be wrong), the replacements you want to do are determined by the last symbol or two in the word. Taking your first example, any verb ending in 'す' will always have 'して' when conjugated this way. Similarly for む -> んで. Could you maybe establish a mapping of last character(s) -> conjugated form. You might have to account for exceptions, such as anything which conjugates to xxって.

As for portability between languages, you'll have to implement the logic differently based on how they work. This solution would be fairly straightforward to implement for Spanish as well, since the conjugations depends on if the verb ends in -ar, -er, or -ir (with some verbs requiring exceptions in your logic). Unfortunately, that's the limit of my multi-lingual skills, so I don't know how well it would do beyond those two.

Actually I have thought about doing the mapping and can see the benefit of it, but also see the benefit of the more on they fly transformation. I have been unsure of what approach and even how to deal with Japanese all together as I code. The big thing is later on when I get to short forms and tai forms is where I see the on they fly helping out. — Buddy Lindsey, May 02 '09 at 19:31

score 0 · Answer 6 · answered May 07 '09 at 05:14

Since most verbs in Japanese follow one of a small set of predictable patterns, the easiest and most extensible way to generate all the forms of a given verb is to have the verb know what conjugation it should follow, then write functions to generate each form depending on the conjugation.

Pseudocode:

generateDictionaryForm(verb)
  case Ru-Verb: verb.stem + る
  case Su-Verb: verb.stem + す
  case Ku-Verb: verb.stem + く
  ...etc.

generatePoliteForm(verb)
  case Ru-Verb: verb.stem + ります
  case Su-Verb: verb.stem + します
  case Ku-Verb: verb.stem + きます
  ...etc.

Irregular verbs would of course be special-cased.

Some variant of this would work for any other fairly regular language (i.e. not English).

score -2 · Answer 7 · edited Jan 28 '16 at 00:36

-2

Try to install my gem (rom2jap). It is in ruby.

gem install rom2jap

Open your terminal and type:

require 'rom2jap'

edited Jan 28 '16 at 00:36

Tristan

3,301
8
22
27

answered Jan 28 '16 at 00:17

user5849542

1

Programming tips with Japanese Language/Characters

7 Answers7

Linked