How to filter out unicode characters in Python?

Question

I am working on unicode values in python for sometime now. Firstly, all the questions and answers help alot. Thanks :)

Next, I am stuck in my project where I want to isolate the unicode values for each language.

Like, a certain function only accepts hindi codes which are from unicode values 0900 to 097F. Now I want it to reject the rest of all the unicode values...

I have , as of now, done

for i in range(len(l1)):
    for j in range(len(l1[i])):
        unn = '%04x' % ord(l1[i][j])
        unn1 = int(unn, 16)
        if unn1 not in range(2304, 2431):
            l1[i] = l1[i].replace(l1[i][j], '')

this code takes in the values from a list l1 and does what I want it to. But there problem is that it solves for one character and then it terminates at line 3

On manually running it again, it runs and again solves one or two characters and then terminates again.

I cant even put it inside a loop....

Please help

Updated:

I didnt wanna put another post so using this one only I got some help and modified the code. There is index problem.

for i in range(len(dictt)):
    j=0
    while(1):
        if j >= len(dictt[i]):
            break
        unn = '%04x' % ord(dictt[i][j])
        unn1 = int(unn, 16)
        j = j+1
        if unn1 not in range(2304, 2431):
            dictt[i] = dictt[i].replace(dictt[i][j-1], '')
            j=0

this code works perfectly fine for my previous query I meant for a specific range but if I change the range or the functionality then again the same problem arises at the same line. Why is that line giving error??

I'm not sure I understand. So you want to allow unicode characters from the specified range and ignore the rest ? Or do you just want to get rid of all unicode characters? — Marek Kowalski, Jun 11 '13 at 08:48
Also, what do you mean by 'terminates'? Is an exception thrown or something? — Blubber, Jun 11 '13 at 08:53
I want to allow a certain range and remove the rest of them. — Shruti Joshi, Jun 12 '13 at 06:06
and yes it leaves an exception for the line unn - '%04x' % ord(l1[i][j]) that the string index is out of range and I dont know why — Shruti Joshi, Jun 12 '13 at 06:07

score 1 · Accepted Answer · edited May 23 '17 at 12:18

The best solution is most likely using regex to filter out the unwanted characters. You basically need a regex to match your Hindi characters, but as far as I know Hindi characters are bugged in "re" module, so I recommend downloading "regex" module with the command:

$ pip install regex

After that you can just simply do a word by word check if all words are written in Hindi:

// kinda pseudo code, sorry
import regex
yourString = your_string_in_hindi
words = yourString.split(" ")
for word in words:
    if not regex.match(HINDI_WORD_REGEX, word):
        // whatever you want to do

You can also find some useful information related to your problems here:

Python - pyparsing unicode characters

Python unicode regular expression matching failing with some unicode characters -bug or mistake?

Hope this at least helps you to start. Good luck!

score 0 · Answer 2 · answered Jun 11 '13 at 09:06

0

def filter(text, range):
    return ''.join([char for char in text if ord(char) in range])

answered Jun 11 '13 at 09:06

Pouya

159
1
6

score 0 · Answer 3 · answered Jun 11 '13 at 09:09

0

try this:

def converter(string_, range_ = (2304, 2431)):
    """ Filter the unicode characters """
    min, max = range_
    return ''.join(c for c in string_ if (min <= ord(c) < max))

answered Jun 11 '13 at 09:09

Oleksandr Fedorov

1,213
10
17

How to filter out unicode characters in Python?

3 Answers3