read tweets extracted with python

Question

I am trying to read tweets in excel. Tweets have been retrieved with python (and tweepy) then saved in a csv file:

# -*- coding: utf-8 -*-
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w"), lineterminator='\n', delimiter =';')
writer.writerow(["username", "nb_followers", "tweet_text"])

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

for tweet in tweepy.Cursor(api.search, q="dengue+OR+%23dengue", lang="en", since=date, until=end_date).items():
    username=tweet.user.screen_name
    nb_followers=tweet.user.followers_count
    tweet_text=tweet.text.encode('utf-8')  

    writer.writerow([username, nb_followers, tweet_text])

Due to the utf-8 encoding, I have problems reading them in a text editor or excel. For example this tweet:

gives this in excel:

b"\xe2\x80\x9c@ThislsWow: I want to do this \xf0\x9f\x98\x8d http://t.co/rGfv9e70Tj\xe2\x80\x9d pu\xc3\xb1eta you're going to get bitten by the mosquito and get dengue"

How to get the original characters? How to remove the b at the beginning, useful only in a python program?

EDIT :

As per Alastair McCormack's comment: I removed the encoding of my field and added it in the writer:

writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w", encoding="UTF-8"), lineterminator='\n', delimiter =';')
tweet_text=tweet.text.replace("\n", "").replace("\r", "")

Now I have the following error:

tweet: Traceback (most recent call last):
  File "twitter_influence.py", line 88, in <module>
    print("tweet:", tweet_text)
  File "C:\Users\rlalande\Envs\tweepy\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 137: character maps to <undefined>

EDIT2 :

I am now using the following:

import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())

(seen in this post: https://stackoverflow.com/a/4374457/1875861)

There is no more error but it doesn't output the correct characters.

For example this tweet: gives this output in excel:

Malay Mail Online  Alarming rise in dengue casesMalay Mail Onlineâ€œThe ministry started a campaign for constructionâ€¦ http://t.co/MuLFlMwkY0

Before, with direct encoding of the field, I had:

b'Malay Mail Online\n\nAlarming rise in dengue casesMalay Mail Online\xe2\x80\x9cThe ministry started a campaign for construction\xe2\x80\xa6 http://t.co/MuLFlMwkY0'

The result is different but not really better... Why is the quote character not outputted correctly? In one case it outputs â€¦ and in the other case \xe2\x80\xa6.

Alastair McCormack · Accepted Answer · 2015-09-07T16:48:00.850

1

It's because the CSV writer expects all input to be Unicode strings. You're getting the __repr__() of a byte string.

Set the encoding of your output file by replacing the first line with:

writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w", encoding="UTF-8"), lineterminator='\n', delimiter =';')

This means that any Unicode strings written to the file will be translated automagically. Then remove the explicit encode():

tweet_text=tweet.text

Edit:

Excel needs to be coerced into reading UTF-8 files if you don't use the import function. The easiest way to do this is to add UTF-8 BOM signature to the start of the file.

Python provides a shortcut if you use the utf_8_sig encoding. E.g.

writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w", encoding="utf_8_sig"), lineterminator='\n', delimiter =';')

You can also check your file in a decent UTF-8 editor like Notepad++ or Atom.

edited Sep 07 '15 at 16:48

answered Sep 04 '15 at 19:27

Alastair McCormack

26,573
8
77
100

Where and what's the error? Can you point to the line that is throwing the error? – Alastair McCormack Sep 07 '15 at 07:50
Also, which version of Python 3.x are you using? I see that early 3.x versions didn't support Unicode in CSV in the same way as 2.x does. – Alastair McCormack Sep 07 '15 at 08:01
I have python 3.4.3. I have edited my initial post to provide more information about the error. – rom Sep 07 '15 at 11:46
1

Remove the print statement (as per original line listing) and it'll work. To ensure your stdout works correctly, set `PYTHONIOENCODING=utf-8` in your environment or see: http://stackoverflow.com/questions/4374455/how-to-set-sys-stdout-encoding-in-python-3 – Alastair McCormack Sep 07 '15 at 12:02
@rom I've just noticed you're on Windows. Windows Console encoding is brokem. Make sure to run your scripts within an IDE (IDLE, Eclipse or PyCharm), or iPython, or follow: https://pypi.python.org/pypi/win_unicode_console. You'll probably find it easier just to always write UTF-8 output to a file instead. – Alastair McCormack Sep 07 '15 at 12:26
I just tried win unicode console, it doesn't work for me. However, using the post you linked is interesting as it gives a slightly different result. But the problem is still not solved. See my original post... – rom Sep 07 '15 at 15:36
1

See my update. Validate you can see the chars in something other than Excel. – Alastair McCormack Sep 07 '15 at 16:49
You're right, it works quite well in notepad++. The weird characters correspond to "...". There are still other undefined characters, represented with squares. They correspond to smileys, so it doesn't matter much for my needs ;). Thanks! – rom Sep 09 '15 at 07:12

read tweets extracted with python

1 Answers1