python : unicodeEncodeError: 'charpmap' codec can't encode character '\u2026'

Question

I try to analyse some tweets I got from tweeter, but It seems I have a probleme of encoding, if you have any idea..

import json

#Next we will read the data in into an array that we call tweets.
tweets_data_path = 'C:/Python34/TESTS/twitter_data.txt'

tweets_data = []
tweets_file = open(tweets_data_path, "r")


for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
    except:
        continue

print(len(tweets_data))#412 tweets
print(tweet)

I got the mistake : File "C:\Python34\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] unicodeEncodeError: 'charpmap' codec can't encode character '\u2026' in position 1345: character maps to undefined

At work, I didn't get the error, but I have python 3.3, does it make a difference, do you think ?

-----EDIT

The comment from @MarkRamson answered my question

Can you provide the line where the UnicodeEncodeError happen? How did you write theses tweets? Did you encode them in UTF-8? — Raito, Feb 08 '15 at 21:12
I will look tonight but I got the tweet from the twitter API and checked that the encoding of the file was UTF-8 — Stéphanie C, Feb 09 '15 at 11:07
The problem is that the console you're running on is not capable of handling the character you're trying to print: `…` See http://stackoverflow.com/questions/3597480/how-to-make-python-3-print-utf8 for some hints. — Mark Ransom, Feb 09 '15 at 21:29

score 2 · Answer 1 · answered Feb 08 '15 at 21:13

2

You should specify the encoding when opening the file:

tweets_file = open(tweets_data_path, "r", encoding="utf-8-sig")

answered Feb 08 '15 at 21:13

Tim Pietzcker

328,213
58
503
561

1

Is utf-8-sig an educated guess? ;) Then you should argue for it. – Dr. Jan-Philip Gehrcke Feb 08 '15 at 21:14
@Jan-PhilipGehrcke: U+2026 is the ellipsis character which makes sense in a tweet. Also, UTF-8 is the most likely encoding that any Twitter API would use. So yes, I'd say it's an educated guess... (and the -sig part is just a precaution - if there is a BOM, it handles it, if there isn't, no harm done). – Tim Pietzcker Feb 08 '15 at 21:15
Okay, half convinced. And to state the obvious, for other readers: the data is coming from a random file twitter_data.txt, so we actually can not reliably know how the data in there is encoded. – Dr. Jan-Philip Gehrcke Feb 08 '15 at 21:25
1

I guess the problem is rather with the `print` line than `open`. – georg Feb 08 '15 at 21:28
@georg: Quite possibly, yes. `print(tweet.encode("ascii", errors="ignore").decode("ascii"))` might help... (instead of "ascii", you should probably use your terminal's encoding). – Tim Pietzcker Feb 08 '15 at 21:34
Actually I got an error when trying to get the content with the following lines : `tweets = pd.DataFrame() tweets['lang'] = map(lambda tweet: tweet['lang'], tweets_data)`. I made a print to see why I coudn't get the content and I assumed that the reason was the encoding problem encountered by the function print. I tried the same code today at work (another computer then), and I didn't get the encoding error... I wish I could understand why, but I will try your proposition tonight on my home computer and keep you updated. Thank you ! – Stéphanie C Feb 09 '15 at 11:30

python : unicodeEncodeError: 'charpmap' codec can't encode character '\u2026'

1 Answers1