3

I need to detect language changes in a file, and tag each word accordingly. I've come up with a hacky way, that works for 2 languages (english and greek).

The script is this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys

#open file
filename = sys.argv[1]
f = open(filename,'r')
content = f.read()
f.close()


#initialize new content
newSentence=''
#for every line, if the first letter of the token isn't ascii, it's nonsense, tag it.
for line in content.split('\n'):
    newSentence+='\n'
    for token in line.split():
        try:
            result = token[0].decode('ascii','ignore')
            newSentence += ' /en'+token
        except:
            newSentence += ' /gr'+token


print newSentence

f=open(filename+'_new.txt','w')
f.write(newSentence)
f.close()

The main idea is that if the first letter of each word isn't ascii decodeable it mustn't be english,so it's the only other option.

Now i realize this is awfully hacky and I'd like to know how would I go about doing it in a more pythonic way? Even in a way that works for multiple languages in a document.

PS. I know how to detect language in a document in general, however I was wondering if there was faster way to detecting just the changes without invoking tools such as nltk etc.

themistoklik
  • 880
  • 1
  • 8
  • 19
  • 1
    Your basic approach is probably as good as it gets for a limited two-language problem with distinct character sets, except that you cannot trust English to stay strictly within ASCII. There are legitimate uses for diacritics like trema (naïve, zoölogy) and acute accent (résumé, exposé), as well as assorted other accents in loan words like doppelgänger, smörgåsbord, and of course the euro € and pound £ symbols, etc. See also https://en.wikipedia.org/wiki/English_terms_with_diacritical_marks – tripleee Dec 17 '15 at 05:51

1 Answers1

0

Since no other answer has been posted for a long time, I'm accepting the slightly edited initial script as the best fix for my problem.

While looking into it another approach better that ignoring errors would be normalizing first.

Community
  • 1
  • 1
themistoklik
  • 880
  • 1
  • 8
  • 19