Removing non-supported unicode characters using a list comprehension

Question

I'm trying to write an algorithm to remove non-ASCII characters from a list of strings of text. I put together the list by scraping paragraphs from a web page and adding them to a list. To do this, I wrote a nested for loop that loops through each element of the list containing a string, then loop through the characters of the string. My example list of strings that I used is here:

text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China',
        'It is characterised by large, black patches around its eyes, over the ears, and across its round body']

My last action is then to replace characters if their ord() value is greater than 128. Like so:

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    for i in range(len(text_list)):
        # for each string in the text list
        for char in text_list[i]:
            # for each character in the individual string
            if ord(char) > 128:
              text_list[i] = text_list[i].replace(char, '')

    return text_list

And this works fine as a nested for loop. But since I'm wanting to scale this, I thought I'd write it as a list comprehension. Like so:

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    scrubbed_text = [text_list[i].replace(char, '') for i in range(len(text_list))
                     for char in text_list[i] if ord(char) > 128]

    return scrubbed_text

But this doesn't work for some reason. At first I thought it might have to do with the method I was using in my expression to remove the unicode characters, since text_list is a list but text_list[i] is a string. So I changed my method from .strip() to .replace(). That didn't work. Then I thought it might have to do with where I was placing the .replace(), so I moved it around the list comprehension with no change. So I'm at a loss. I think maybe it might have to do with converting between this specific case of a nested for loop involving filtering unicode that might be the issue. Since not all for loops can be written as list comps but all list comps can be written as for loops.

List comprehension, in general, is not faster than a nested loop. Your second solution, even if correct, is not more scalable than the first. Consider using regular expression substitution instead. — DYZ, Mar 07 '21 at 05:18
@DYZ: a list comprehension is also not necessarily slower than a nested loop so that is not a reason to avoid use of one (if that is what you are implying?). A list comprehension It is more concise and "Pythonic" so would be preferred in many cases (performance issues aside). Regarding performance, a generator expression could be more efficient/scalable depending on how the result is to be consumed. — mhawke, Mar 07 '21 at 05:32
@mhawke your encode/decode solution is actually much faster though, by an order of magnitude. — Mark Tolonen, Mar 07 '21 at 05:34
@mwake I did not say list comprehension is slower. I said it is not faster. And a generator expression is typically slower than list comprehension. — DYZ, Mar 07 '21 at 05:59
@MarkTolonen How would you measure the performance of your algorithm. Is there an API or library that counts ticks or ns to complete the algorithm? — pancham2016, Mar 08 '21 at 18:32
@mhawke Doesn't it depend on the actual implementation of the dictionary as to how efficient it performs. I just assumed that dictionaries were faster than nested for, since nested for are usually O(n^2) — pancham2016, Mar 08 '21 at 18:34
@pancham2016 I used the `%timeit` magic in IPython, or you can use the `timeit` module. — Mark Tolonen, Mar 08 '21 at 19:00
@DYZ: Sorry, I thought that I was careful to not say that you said it was slower :) I meant to say (and I think that I did) that list comprehensions are preferable to standard loops due to their conciseness and idiomatic use in Python. — mhawke, Mar 08 '21 at 23:18
@DYZ: How is a generator expression typically slower? It''s more efficient when it is used in the right way. Such blanket statements are very misleading to those that might not fully grasp the nuances. On a large amount of data that can be consumed in chunks, it could avoid the runtime and memory expenses of building a fully resolved list. In this example, yielding each converted list item could be more efficient *if* the client code does not require the results in one go. Similarly, the `text_list` could itself be a generator. We would need to know more about the client code to be certain. — mhawke, Mar 08 '21 at 23:28
@mhawke A comprehension expression further converted to a list is slower than the equivalent list comprehension. — DYZ, Mar 09 '21 at 00:53
@DYZ: I'm sure that you can see that that is not always true when you look at the full context. If you do not want to consider the full context then your statement is true. But I think that you know that :) Otherwise it is just another blanket statement that misleads others. If not for improved efficiency in the right context, for what reason do generator expressions exist? — mhawke, Mar 09 '21 at 01:01
@mhawke Generator expressions exist to avoid creating large _intermediate_ lists (which is not the case in the OP). They do it at the expense of performance. — DYZ, Mar 09 '21 at 04:16
@DYZ: yes, which reduces time and memory requirements due to performance issues caused by low memory environments. In the toy example that is this question there would not be a performance advantage. In the real world there _might_ be - which is why I took pains to explain that the context _might_ be important. Please don't persist with this straw man argument - you need to look at all factors and avoid the dogmatic opinion that you have about generators. — mhawke, Mar 09 '21 at 09:26

mhawke · Answer 1 · 2021-03-07T05:31:58.917

There is an easier way to remove non-ascii characters; encode the string to ASCII and specify errors='ignore' to have them removed. For example:

text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China',
        'It is characterised by large, black patches around its eyes, over the ears, and across its round body']

>>> text[0].encode('ascii', errors='ignore')
b'The giant panda (Ailuropoda melanoleuca; Chinese: ; pinyin: dxingmo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China'

That will give you a byte string, i.e the result is of type bytes. You can convert that back to a Python string using decode():

>>> text[0].encode('ascii', errors='ignore').decode()
'The giant panda (Ailuropoda melanoleuca; Chinese: ; pinyin: dxingmo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China'

You could be pedantic and specify .decode('ascii') but your default codec will likely cover that already.

To perform that as a list comprehension:

def remove_non_ascii_chars(text_list):
    return [s.encode('ascii', errors='ignore').decode('ascii') for s in text_list]

>>> remove_non_ascii_chars(text)
['The giant panda (Ailuropoda melanoleuca; Chinese: ; pinyin: dxingmo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China', 'It is characterised by large, black patches around its eyes, over the ears, and across its round body']

You could also code the function to return a generator which will be more scalable in many cases, depending on how the strings are consumed in subsequent code:

def remove_non_ascii_chars(text_list):
    return (s.encode('ascii', errors='ignore').decode('ascii') for s in text_list)

Thanks, mhawke. Ya, I tried using .encode() and .decode() before, but made the mistake of using those methods on my data soup object before processing it — pancham2016, Mar 08 '21 at 18:29
Yes, it does. I can't quite pin down what my original mistake was but it had something to do with at what step in processing I decided to use decode and encode — pancham2016, Mar 13 '21 at 01:29

score 0 · Answer 2 · answered Mar 07 '21 at 05:18

You either need an outer loop, or a second comprehension to parse the list, then in the inner one, parse the string:

def remove_utf_chars(text_list):
    scrubbed_text = ["".join([y for y in x if ord(y) < 128]) for x in text_list]
    return scrubbed_text

Removing non-supported unicode characters using a list comprehension

2 Answers2