-1
import re
data2 = ''
file = open('twitter.txt', 'r')
for i in file:
    thing = re.sub(r'[^\x00-\x7f]',r'', str(file[i]))
    print(str(thing))

Hi, I'm very new to Python. After scraping a bunch of data from Twitter using Python, I put the data into a text file. The text file ends up with a lot of emojis and other non-ASCII characters that can't be turned into a String. The above code is my attempt to remove the non-ASCII characters and turn the file into a String, but it ends up giving me the error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1607: character maps to <undefined>

How can I remove the non-ASCII characters then turn the remaining text into a String?

  • this is probably what you're looking for https://stackoverflow.com/questions/1207457/convert-a-unicode-string-to-a-string-in-python-containing-extra-symbols – Jay Sep 23 '18 at 05:05
  • "emojis and other non-ASCII characters that can't be turned into a String" -- that's a misinterpretation, Python's strings are fully Unicode-capable. – Ulrich Eckhardt Sep 23 '18 at 06:11

1 Answers1

1

~Python 3.6

def return_only_ascii(str)
    return ''.join([x for x in str if ord(x) < 128])

Python 3.7~

def return_only_ascii(str)
    return ''.join([x for x in str if x.isascii()])

Result

>>> return_only_ascii('José')
'Jos'