I'm trying to write an algorithm to remove non-ASCII characters from a list of strings of text. I put together the list by scraping paragraphs from a web page and adding them to a list. To do this, I wrote a nested for loop that loops through each element of the list containing a string, then loop through the characters of the string. My example list of strings that I used is here:
text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China',
'It is characterised by large, black patches around its eyes, over the ears, and across its round body']
My last action is then to replace characters if their ord() value is greater than 128. Like so:
def remove_utf_chars(text_list):
"""
:param text_list: a list of lines of text for each element in the list
:return: a list of lines of text without utf characters
"""
for i in range(len(text_list)):
# for each string in the text list
for char in text_list[i]:
# for each character in the individual string
if ord(char) > 128:
text_list[i] = text_list[i].replace(char, '')
return text_list
And this works fine as a nested for loop. But since I'm wanting to scale this, I thought I'd write it as a list comprehension. Like so:
def remove_utf_chars(text_list):
"""
:param text_list: a list of lines of text for each element in the list
:return: a list of lines of text without utf characters
"""
scrubbed_text = [text_list[i].replace(char, '') for i in range(len(text_list))
for char in text_list[i] if ord(char) > 128]
return scrubbed_text
But this doesn't work for some reason. At first I thought it might have to do with the method I was using in my expression to remove the unicode characters, since text_list is a list but text_list[i] is a string. So I changed my method from .strip() to .replace(). That didn't work. Then I thought it might have to do with where I was placing the .replace(), so I moved it around the list comprehension with no change. So I'm at a loss. I think maybe it might have to do with converting between this specific case of a nested for loop involving filtering unicode that might be the issue. Since not all for loops can be written as list comps but all list comps can be written as for loops.