5

I've been thinking about using Markov techniques to restore missing information to natural language text.

  • Restore all-caps text to mixed-case.
  • Restore accents / diacritics to languages which should have them but have been converted to plain ASCII.
  • Convert rough phonetic transcriptions back into native alphabets.

That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context.

I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Hidden Markov Models to resolve the ambiguities.

Am I on the right track? Are there already some services, libraries, or tools for this sort of thing?

Examples

  • GEORGE LOST HIS SIM CARD IN THE BUSH   ⇨   George lost his SIM card in the bush
  • tantot il rit a gorge deployee   ⇨   tantôt il rit à gorge déployée
Brock Adams
  • 90,639
  • 22
  • 233
  • 295
hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • you want to use probabilistic techniques and draw on Wikipedia as the body of text to draw on for these probabilities? pretty cool idea, but dont get lost in it. The dictionary should still be relied on first. (you seem to realize that when you refer to 'ambiguities' but its still worth the reminder IMO) – jon_darkstar Dec 21 '10 at 02:19
  • In point 1, do you mean restore text in all caps to mixed case instead? Otherwise toUpperCase() will be all you need. – Stompchicken Dec 21 '10 at 10:07
  • @StompChicken: No I want to restore lost/missing information, the much harder problem: "GEORGE LOST HIS SIM CARD IN THE BUSH" -> "George lost his SIM card in the bush". – hippietrail Dec 21 '10 at 11:57
  • 1
    That's what I thought, you might want to edit your first point then. – Stompchicken Dec 21 '10 at 12:05
  • Wow I still hadn't noticed my wording was backwards - thanks! – hippietrail Dec 21 '10 at 12:08

2 Answers2

4

I think you can use Markov models (HMMs) for all three tasks, but also take a look at more modern models such as conditional random fields (CRFs). Also, here's some boost for your google-fu:

  • Restore mixed case to text in all caps

This is called truecasing.

  • Restore accents / diacritics to languages which should have them but have been converted to plain ASCII

I suspect Markov models are going to have a hard time on this. OTOH, labelled training data is free since you can just take a bunch of accented text in the target language and strip the accents. See also next answer.

  • Convert rough phonetic transcriptions back into native alphabets

This seems strongly related to machine transliteration, which has been tried using pair HMMs (from bioinformatics/genome work).

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
2

I'll take a crack at fleshing out how you would accomplish these.

Capitalisation

This is fairly close to Named Entity Recognition and is an example of a 'sequence tagging problem'. Proper nouns should be initially capitalised, organisation names that are acronyms should be all capitalised, and then there are other examples that fall outside those categories. It seems to me that it would therefore be harder than NER and so a straightforward dictionary based approach probably wouldn't be an ideal solution.

If you were to use an Hidden Markov Model, this would amount to letting the 'hidden' state of an HMM be [lowerCase, initCaps, allCaps] and training on some data that you assume is correct (e.g. Wikipedia but there are many other sources too). You then infer the hidden state for words that you aren't sure are correctly capitalised. There are a bunch of HMM libraries out there, I'm sure you can find one to suit your needs. I'd say trying an HMM is a good initial choice.

Non ASCII characters

As you guessed, a tougher problem. If you tried to do this with an HMM at the word level, you would have an enormous number of hidden states, one for each accented word, which would probably be impossible to train. The problem is more tractable at the character level but you lose a tremendous amount of context if you only consider the previous character. If you start using n-grams instead of characters, your scaling problems come back. In short, I don't think this problem is like the previous one because the number of labels is too large to consider it a sequence labelling problem (I mean you can, it's just not practical).

I haven't heard of research in this area, then again I'm no expert. My best guess would be to use a general language model for the language you are interested in. You could use this to give you a probability of a sentence in the language. Then you could try replacing possibly accented characters to give the probabilities of those sentences and take the most likely, or use some threshold on the difference, or something like that. You could train an n-gram language model fairly easily on a large corpus of a certain language.

I have no idea if this would actually work, either in terms of accuracy or efficiency. I don't have direct experience of this particular problem.

Transliteration

No idea, to be honest. I don't know where you would find data to make a system of your own. After a brief search, I found the Google Transliteration service (with API). Perhaps it does what you're after. I don't even have enough experience in languages with other scripts to really know what it's doing.

Stompchicken
  • 15,833
  • 1
  • 33
  • 38
  • 1
    I assumed when you said 'Markov chain' you meant Hidden Markov Model, which may not have been the case. The makes the first para of the second section a bit irrelevant. Everything else stands though. – Stompchicken Dec 21 '10 at 14:52
  • Yes I meant Markov models. I'd read so much lately that I remembered more gists than terminology (-: – hippietrail Dec 21 '10 at 17:00
  • Okay, I think the difference I was getting at is that the first one is a simple example of a classification problem (where an HMM would be quite appropriate), whereas the second is much harder to model in that way. – Stompchicken Dec 22 '10 at 11:16