I'm getting a really weird error. A character is added basically when I try to use utf-8.
Code 1
Here's the link content I'm analyzing:
https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0
new_tweets = 'content in the link'
The following code now pulls out the tweets just as I want them:
outtweets = [[tweet.text] for tweet in new_tweets]
print(outtweets)
Output:
[['@sicaleigh That is false.'], ['RT @ArgonautNews: @mikebonin wants more cops on patrol. #LAPD], ["RT @LAHomelessCount: We've exceeded 7,000 registered volunteers and the # is climbing! Thx all. Let's do this and help end homelessness. #t…"]]
(links deleted because SE requires it)
Problem
The problem is this code doesn't work for parsing many accounts. You need to encode them using utf-8 for whatever reason.
Code 2
Here's my modified code to do that
outtweets = [[tweet.text.encode("utf-8")] for tweet in new_tweets]
print(outtweets)
Problem
But this results in a weird set of b's being put in front of my tweets that I don't want.
[[b'@sicaleigh That is false.'], [b'RT @ArgonautNews: @mikebonin wants more cops on patrol. #LAPD], [b"RT @LAHomelessCount: We've exceeded 7,000 registered volunteers and the # is climbing! Thx all. Let's do this and help end homelessness. #t\xe2\x80\xa6"]]
My Question:
Why is this character being added? How do I get rid of it?
In some cases, it is not just a b but an additional set of quotation marks "" around it. So I'm not sure just removing the first character will work