strangely added character to my strings when using .encode("utf-8")

Question

I'm getting a really weird error. A character is added basically when I try to use utf-8.

Code 1

Here's the link content I'm analyzing:

https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0

new_tweets = 'content in the link'

The following code now pulls out the tweets just as I want them:

outtweets = [[tweet.text] for tweet in new_tweets]
print(outtweets)

Output:

[['@sicaleigh That is false.'], ['RT @ArgonautNews: @mikebonin wants more cops on patrol. #LAPD], ["RT @LAHomelessCount: We've exceeded 7,000 registered volunteers and the # is climbing! Thx all. Let's do this and help end homelessness. #t…"]]

(links deleted because SE requires it)

Problem

The problem is this code doesn't work for parsing many accounts. You need to encode them using utf-8 for whatever reason.

Code 2

Here's my modified code to do that

outtweets = [[tweet.text.encode("utf-8")] for tweet in new_tweets]
print(outtweets)

Problem

But this results in a weird set of b's being put in front of my tweets that I don't want.

[[b'@sicaleigh That is false.'], [b'RT @ArgonautNews: @mikebonin wants more cops on patrol. #LAPD], [b"RT @LAHomelessCount: We've exceeded 7,000 registered volunteers and the # is climbing! Thx all. Let's do this and help end homelessness. #t\xe2\x80\xa6"]]

My Question:

Why is this character being added? How do I get rid of it?

In some cases, it is not just a b but an additional set of quotation marks "" around it. So I'm not sure just removing the first character will work

`encode()` convert `UNICODE` string to bytes `UTF-8`, `dencode()` convert bytes `UTF-8` to `UNICODE` string. — furas, Jan 28 '17 at 22:10
if you use `print()` to print list/dictionary then it uses `repr()` to create strings usfull for debuging. Use `join()` to create one string before you `print()` it. — furas, Jan 28 '17 at 22:11
@furas So the duplicate doesn't tell me how to remove it. Like I just don't want it in my text output. If you explained how to do that, I didn't follow. Could you elaborate? — Stan Shunpike, Jan 28 '17 at 22:18
`outtweets = [ tweet.text for tweet in new_tweets]` without internal `[ ]` and layter `print( "\n".join(outtweets) )` or use `for` loop `for txt in outtweets: print(txt)` — furas, Jan 28 '17 at 22:25
But, as I mentioned, without doing the UTF-8, I run into errors. When it tries to download characters from the web, it gives back `UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to ` — Stan Shunpike, Jan 28 '17 at 22:32
do you use `Windows` ? then it is probably problem only went you print it because your console/cmd.exe doesn't use `'utf-8'`. See: http://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8 — furas, Jan 28 '17 at 22:36
BTW: always put **FULL** error message **IN QUESTION** . It has many usefull information. — furas, Jan 28 '17 at 22:38

strangely added character to my strings when using .encode("utf-8")

Code 1

Problem

Code 2

Problem

My Question:

0 Answers0

Linked