I need some help in getting words from a mix of language it contains english and telugu language, here is my code so far
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
sentence="hello world యూనియన్ యూనియన్"
sentence=sentence.decode('utf-8')
for m in re.finditer(ur'(\w|\’\w|\'\w)+', sentence, re.UNICODE):
start, end = m.span()
word = m.group().encode('utf-8')
print start, end, word
the result i'm expecting is
0 5 hello
6 11 world
11 17 యూనియన్
17 23 యూనియన్
but the result i get is
0 5 hello
6 11 world
12 13 య
14 15 న
16 18 యన
20 21 య
22 23 న
24 26 యన
the code split every character of the langauge and give independent start and end lengths. is there any way that i can get the result in the above format as words instead of characters