Encoding tweets to UTF-8 creates weird characters in Python

Question

I am downloading all of a user's tweets, using the twitter API.

When I download the tweets, I encode them in utf-8, before placing them in a CSV file.

tweet.text.encode("utf-8")

I'm using python 3

The issue is that this creates really weird characters in my files. For example, the tweet which reads

"But I’ve been talkin' to God for so long that if you look at my life, I guess he talkin' back."

Gets turned into

"b""But I\xe2\x80\x99ve been talkin' to God for so long that if you look at my life, I guess he talkin' back. """

(I see this when I open the CSV file that I wrote this encoded text to).

So my question is, how can I stop these weird characters from being created.

Also, if someone can explain what the b' which starts every line, means, that would be super helpful.

Here is the full code:

    outtweets = [ [tweet.text.encode('utf-8')] for tweet in alltweets]

#write the csv  
with open('%s_tweets.csv' % screen_name, 'wt') as f:
    writer = csv.writer(f)
    writer.writerow(["text"])
    writer.writerows(outtweets)

You don't have to encode the text explicitly. This is done by file output encoding. — Daniel, Jul 15 '17 at 18:05
But if I don't encode it explicitly, I get the following error: 'UnicodeEncodeError: 'ascii' codec can't encode character '\u201c' in position 1: ordinal not in range(128)' — James Dorfman, Jul 15 '17 at 18:06
It's okay for them to remain encoded. When reading into a program, you can open it and specify an encoding scheme: `with open('test.csv', encoding='utf-8') as f:` — cs95, Jul 15 '17 at 18:09

Anthon · Accepted Answer · 2018-11-10T15:51:29.447

4

That is not a strange character, that is a RIGHT SINGLE QUOTATION MARK (U+2019). You can often see that character in submits done from OSX based browsers.

If you need ASCII for everything you can try:

import unicodedata
unicodedata.normalize('NFKD', tweet.text).encode('ascii','ignore')

If you encode a string in to bytes sequence, and then output that bytes sequence, you should expect the b"..." that indicates a byte sequence and not a normal string.

edited Nov 10 '18 at 15:51

answered Jul 15 '17 at 18:07

Anthon

69,918
32
186
246

Is there a way for me to convert all those into their proper symbols? – James Dorfman Jul 15 '17 at 18:10
Why do you think it is not a proper symbol. I hope you're not restricting properness to the ASCII characters set, do you? – Anthon Jul 15 '17 at 18:11
1

In case anyone else read this answer and wondered what the `'NFKD'` option does, see [wiki page](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) - via https://stackoverflow.com/a/14682498 – Bonlenfum Nov 12 '18 at 08:53

Hendrik Makait · Answer 2 · 2017-07-15T18:24:27.440

You are using str.encode(), which turns the string in to a bytes object, hence the b at the beginning of the string.

https://docs.python.org/3/library/stdtypes.html#str.encode

EDIT: I could not reproduce the UnicodeError from the code you provided. The following works fine for me:

import csv

class Tweet:
    def __init__(self, text):
        self.text = text

alltweets = [Tweet("But I’ve been talkin' to God for so long that if you look at my life, I guess he talkin' back.")]

outtweets = [ [tweet.text] for tweet in alltweets]

#write the csv
with open('test.csv', 'wt') as f:
    writer = csv.writer(f)
    writer.writerow(["text"])
    writer.writerows(outtweets)

resulting in

text
"But I’ve been talkin' to God for so long that if you look at my life, I guess he talkin' back."

Where exactly does the error get raised and for which string?

Well, not encoding it in the first place would be the way to go. Could you share an example of your code so that we can see what exactly causes your error? It should suffice just saving the (by default UTF-8 encoded) string to the csv. — Hendrik Makait, Jul 15 '17 at 18:12

score 1 · Answer 3 · answered Jul 15 '17 at 18:16

1

You have to give the correct output encoding when writing your csv-file:

with open("tweets.csv", 'wt', encoding="utf8") as output:
    writer = csv.writer(output)
    writer.writerows([tweet.text] for tweet in alltweets)

answered Jul 15 '17 at 18:16

Daniel

42,087
4
55
81

Encoding tweets to UTF-8 creates weird characters in Python

3 Answers3