0

I might be asking this wrong, but please help if I am. I need to establish whether a string contains non-ascii characters in order to separate them from the ones that is purely ascii.

I gather a string from multiple separate files and need to remove the non-ascii containing ones so that I can place the strings in a list to be used further. Without any filtering I get the following error while extracting the strings:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 40: ordinal not in range(128)

I would like to achieve the following:

Read string
if string contains non-ascii
->add to list
else
->do not add to list.

All I need to do is determine the how to filter, I have the rest of the code in tact.

  • 4
    Possible duplicate of [How to check if a string in Python is in ASCII?](http://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii) – Dleep Sep 30 '15 at 23:34

1 Answers1

2

You can attempt to encode the string and use a try/except to detect those that contain non-ascii characters. Something like this might work for you:

ascii_strings = []
non_ascii_strings = []
for s in sequence_of_strings:
    try:
        if isinstance(s, bytes):    # handle Python 3 byte strings
            _ = s.decode('ascii')
        else:
            _ = s.encode('ascii')
        ascii_strings.append(s)
    except UnicodeError:
        non_ascii_strings.append(s)

That's the general idea and it should work in Python 2 and 3.

mhawke
  • 84,695
  • 9
  • 117
  • 138