I need to detect language changes in a file, and tag each word accordingly. I've come up with a hacky way, that works for 2 languages (english and greek).
The script is this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
#open file
filename = sys.argv[1]
f = open(filename,'r')
content = f.read()
f.close()
#initialize new content
newSentence=''
#for every line, if the first letter of the token isn't ascii, it's nonsense, tag it.
for line in content.split('\n'):
newSentence+='\n'
for token in line.split():
try:
result = token[0].decode('ascii','ignore')
newSentence += ' /en'+token
except:
newSentence += ' /gr'+token
print newSentence
f=open(filename+'_new.txt','w')
f.write(newSentence)
f.close()
The main idea is that if the first letter of each word isn't ascii decodeable it mustn't be english,so it's the only other option.
Now i realize this is awfully hacky and I'd like to know how would I go about doing it in a more pythonic way? Even in a way that works for multiple languages in a document.
PS. I know how to detect language in a document in general, however I was wondering if there was faster way to detecting just the changes without invoking tools such as nltk etc.