I am trying to read tweets in excel. Tweets have been retrieved with python
(and tweepy
) then saved in a csv
file:
# -*- coding: utf-8 -*-
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w"), lineterminator='\n', delimiter =';')
writer.writerow(["username", "nb_followers", "tweet_text"])
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
for tweet in tweepy.Cursor(api.search, q="dengue+OR+%23dengue", lang="en", since=date, until=end_date).items():
username=tweet.user.screen_name
nb_followers=tweet.user.followers_count
tweet_text=tweet.text.encode('utf-8')
writer.writerow([username, nb_followers, tweet_text])
Due to the utf-8 encoding, I have problems reading them in a text editor or excel. For example this tweet:
gives this in excel:
b"\xe2\x80\x9c@ThislsWow: I want to do this \xf0\x9f\x98\x8d http://t.co/rGfv9e70Tj\xe2\x80\x9d pu\xc3\xb1eta you're going to get bitten by the mosquito and get dengue"
How to get the original characters? How to remove the b at the beginning, useful only in a python program?
EDIT :
As per Alastair McCormack's comment: I removed the encoding of my field and added it in the writer:
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w", encoding="UTF-8"), lineterminator='\n', delimiter =';')
tweet_text=tweet.text.replace("\n", "").replace("\r", "")
Now I have the following error:
tweet: Traceback (most recent call last):
File "twitter_influence.py", line 88, in <module>
print("tweet:", tweet_text)
File "C:\Users\rlalande\Envs\tweepy\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 137: character maps to <undefined>
EDIT2 :
I am now using the following:
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
(seen in this post: https://stackoverflow.com/a/4374457/1875861)
There is no more error but it doesn't output the correct characters.
For example this tweet:
gives this output in excel:
Malay Mail Online Alarming rise in dengue casesMalay Mail Online“The ministry started a campaign for construction… http://t.co/MuLFlMwkY0
Before, with direct encoding of the field, I had:
b'Malay Mail Online\n\nAlarming rise in dengue casesMalay Mail Online\xe2\x80\x9cThe ministry started a campaign for construction\xe2\x80\xa6 http://t.co/MuLFlMwkY0'
The result is different but not really better... Why is the quote character not outputted correctly? In one case it outputs … and in the other case \xe2\x80\xa6.