0

I try to analyse some tweets I got from tweeter, but It seems I have a probleme of encoding, if you have any idea..

import json

#Next we will read the data in into an array that we call tweets.
tweets_data_path = 'C:/Python34/TESTS/twitter_data.txt'

tweets_data = []
tweets_file = open(tweets_data_path, "r")


for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
    except:
        continue

print(len(tweets_data))#412 tweets
print(tweet)

I got the mistake : File "C:\Python34\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] unicodeEncodeError: 'charpmap' codec can't encode character '\u2026' in position 1345: character maps to undefined

At work, I didn't get the error, but I have python 3.3, does it make a difference, do you think ?

-----EDIT

The comment from @MarkRamson answered my question

Stéphanie C
  • 809
  • 8
  • 31
  • 1
    Can you provide the line where the UnicodeEncodeError happen? How did you write theses tweets? Did you encode them in UTF-8? – Raito Feb 08 '15 at 21:12
  • I will look tonight but I got the tweet from the twitter API and checked that the encoding of the file was UTF-8 – Stéphanie C Feb 09 '15 at 11:07
  • 1
    The problem is that the console you're running on is not capable of handling the character you're trying to print: `…` See http://stackoverflow.com/questions/3597480/how-to-make-python-3-print-utf8 for some hints. – Mark Ransom Feb 09 '15 at 21:29
  • This is totally that ! Thank you so much – Stéphanie C Feb 23 '15 at 03:03

1 Answers1

2

You should specify the encoding when opening the file:

tweets_file = open(tweets_data_path, "r", encoding="utf-8-sig")
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • 1
    Is utf-8-sig an educated guess? ;) Then you should argue for it. – Dr. Jan-Philip Gehrcke Feb 08 '15 at 21:14
  • @Jan-PhilipGehrcke: U+2026 is the ellipsis character which makes sense in a tweet. Also, UTF-8 is the most likely encoding that any Twitter API would use. So yes, I'd say it's an educated guess... (and the -sig part is just a precaution - if there is a BOM, it handles it, if there isn't, no harm done). – Tim Pietzcker Feb 08 '15 at 21:15
  • Okay, half convinced. And to state the obvious, for other readers: the data is coming from a random file twitter_data.txt, so we actually can not reliably know how the data in there is encoded. – Dr. Jan-Philip Gehrcke Feb 08 '15 at 21:25
  • 1
    I guess the problem is rather with the `print` line than `open`. – georg Feb 08 '15 at 21:28
  • @georg: Quite possibly, yes. `print(tweet.encode("ascii", errors="ignore").decode("ascii"))` might help... (instead of "ascii", you should probably use your terminal's encoding). – Tim Pietzcker Feb 08 '15 at 21:34
  • Actually I got an error when trying to get the content with the following lines : `tweets = pd.DataFrame() tweets['lang'] = map(lambda tweet: tweet['lang'], tweets_data)`. I made a print to see why I coudn't get the content and I assumed that the reason was the encoding problem encountered by the function print. I tried the same code today at work (another computer then), and I didn't get the encoding error... I wish I could understand why, but I will try your proposition tonight on my home computer and keep you updated. Thank you ! – Stéphanie C Feb 09 '15 at 11:30