0

I am working on unicode values in python for sometime now. Firstly, all the questions and answers help alot. Thanks :)

Next, I am stuck in my project where I want to isolate the unicode values for each language.

Like, a certain function only accepts hindi codes which are from unicode values 0900 to 097F. Now I want it to reject the rest of all the unicode values...

I have , as of now, done

for i in range(len(l1)):
    for j in range(len(l1[i])):
        unn = '%04x' % ord(l1[i][j])
        unn1 = int(unn, 16)
        if unn1 not in range(2304, 2431):
            l1[i] = l1[i].replace(l1[i][j], '')

this code takes in the values from a list l1 and does what I want it to. But there problem is that it solves for one character and then it terminates at line 3

On manually running it again, it runs and again solves one or two characters and then terminates again.

I cant even put it inside a loop....

Please help


Updated:

I didnt wanna put another post so using this one only I got some help and modified the code. There is index problem.

for i in range(len(dictt)):
    j=0
    while(1):
        if j >= len(dictt[i]):
            break
        unn = '%04x' % ord(dictt[i][j])
        unn1 = int(unn, 16)
        j = j+1
        if unn1 not in range(2304, 2431):
            dictt[i] = dictt[i].replace(dictt[i][j-1], '')
            j=0

this code works perfectly fine for my previous query I meant for a specific range but if I change the range or the functionality then again the same problem arises at the same line. Why is that line giving error??

Shruti Joshi
  • 173
  • 12

3 Answers3

1

The best solution is most likely using regex to filter out the unwanted characters. You basically need a regex to match your Hindi characters, but as far as I know Hindi characters are bugged in "re" module, so I recommend downloading "regex" module with the command:

$ pip install regex

After that you can just simply do a word by word check if all words are written in Hindi:

// kinda pseudo code, sorry
import regex
yourString = your_string_in_hindi
words = yourString.split(" ")
for word in words:
    if not regex.match(HINDI_WORD_REGEX, word):
        // whatever you want to do

You can also find some useful information related to your problems here:

Python - pyparsing unicode characters

Python unicode regular expression matching failing with some unicode characters -bug or mistake?

Hope this at least helps you to start. Good luck!

Community
  • 1
  • 1
Dropout
  • 13,653
  • 10
  • 56
  • 109
0
def filter(text, range):
    return ''.join([char for char in text if ord(char) in range])
Pouya
  • 159
  • 1
  • 6
0

try this:

def converter(string_, range_ = (2304, 2431)):
    """ Filter the unicode characters """
    min, max = range_
    return ''.join(c for c in string_ if (min <= ord(c) < max))
Oleksandr Fedorov
  • 1,213
  • 10
  • 17