6

I'm working on spell checking of mixed language webpages, and haven't been able to find any existing research on the subject.

The aim is to automatically detect language at a sentence level within mixed language webpages and spell check each against their appropriate language automatically. Assume that we can ignore sentences which mix multiple languages together (e.g. "He has a certain je ne sais quoi"), and assume webpages can't contain more than 2 or 3 languages.

Trivial example (Welsh + English): http://wales.gov.uk/

I'm currently using a mix of:

  • Character distribution (e.g. 0600-06FF = Arabic etc)
  • n-Grams to discern languages with similar characters
  • Dictionary lookup to discern locale, i.e. en-US, en-GB

I have working code but am concerned it may be naive or needlessly re-inventing a wheel. Has anyone else done this before?

Oliver Emberton
  • 951
  • 4
  • 14

2 Answers2

2

You can use API (Google & Yandex) for spell check and language detection - but this option is not very scalable I think.

Other option is to use free lucene tools for spellchecking http://wiki.apache.org/lucene-java/SpellChecker, but you have to index some corpra first - Wikipedia is good choice. LD can be archived by http://textcat.sourceforge.net/

yura
  • 14,489
  • 21
  • 77
  • 126
  • 1
    Lucene's spellchecker is (or at least was, a few versions ago) horribly slow because it computes normalized Levenshtein distance between the unknown word and *every* word in its dictionary. – Fred Foo May 04 '11 at 10:40
  • Yes I'm afraid Google is out for scalability and licensing reasons. Should have stated my minimum list of languages, but it's greater than what I believe TextCat can detect at present. Mostly I'm just checking I'm not needlessly re-inventing the wheel; it looks like I'm not. – Oliver Emberton May 04 '11 at 12:00
  • 1
    2 larsmans: No, not for every. It first search by word ngramm, and then evaluate fast fail Levenshtein. Anyway, it is not very good – yura May 04 '11 at 12:04
-1

With the Languagetool http:/www.languagetool.org Library you can select the languages you need and have the content checked against your set of languages. E.g. for a French/English website you'd check the text against English and French. Obviously there will be more errors when you check against the wrong language.

Example:

If you e.g. check the french text from http://fr.wikipedia.org/wiki/Charte_de_la_langue_fran%C3%A7aise:

La Charte de la langue française (communément appelée la loi 1011) est 
une loi définissant les droits linguistiques de tous les citoyens du 
Québec et faisant du français la langue officielle du Québec.

on http://www.languagetool.org it will show no errors for French and more than 20 errors for English/GB.

The corresponding english text:

The Charter of the French Language (French: La charte de la langue française), also 
known as Bill 101 (Law 101 or French: Loi 101), is a law in the province of Quebec 
in Canada defining French, the language of the majority of the population, as the 
official language of Quebec and framing fundamental language rights. It is the central
legislative piece in Quebec's language policy.

will show 4 errors for English/GB (due to the French citation) and more than 20 errors when you check it agains the French language.

Wolfgang Fahl
  • 15,016
  • 11
  • 93
  • 186