I'm working on spell checking of mixed language webpages, and haven't been able to find any existing research on the subject.
The aim is to automatically detect language at a sentence level within mixed language webpages and spell check each against their appropriate language automatically. Assume that we can ignore sentences which mix multiple languages together (e.g. "He has a certain je ne sais quoi"), and assume webpages can't contain more than 2 or 3 languages.
Trivial example (Welsh + English): http://wales.gov.uk/
I'm currently using a mix of:
- Character distribution (e.g. 0600-06FF = Arabic etc)
- n-Grams to discern languages with similar characters
- Dictionary lookup to discern locale, i.e. en-US, en-GB
I have working code but am concerned it may be naive or needlessly re-inventing a wheel. Has anyone else done this before?