how to tokenize a list of tweets without his errors?

Question

I'm currently working on a program to retrieve a list of tweets on a given topic. Until then I manage to retrieve them and save them in a JSON file which is perfect.

The problem comes when I try to "tokenize" this list of tweets.

I'm having the following error:

Traceback (most recent call last):
  File "C:\Users\TheoLC\Desktop\python\twitter_search\collect+200tw.py", line 77, in <module>
    tweet_token = tweet['text'].tokenize()
TypeError: string indices must be integers

And this is the code :

with open("%s_tweets.json" % search_word, 'a') as f:
    for tweet in new_tweets:
        json.dump(tweet._json, f, indent=4)


with open("%s_tweets.json" % search_word, 'r+') as f:
    for tweet in f:
        tweet_token = tweet['text'].tokenize()
        print('Tweet tokenize : ' + tweet_token)

I also have a second concern which is:

In my program I translate the search word into several languages in order to get as many tweets as possible from my JSON file.

The problem is that instead of getting a JSON with several tweets from several languages I would like all tweets to be translated into English.

So I try to apply the reverse process as follows:

for tweet in new_tweets_fi:
    tweet['text'] = translator.translate(tweet['text'], src='fi', dest='en')
    print("Les tweets en finlandais ont été traduis")

for tweet in new_tweets_fr:
    tweet['text'] = translator.translate(tweet['text'], src='fr', dest='en')
    print("Les tweets en francais ont été traduis")

And here is the error that comes back:


Traceback (most recent call last):
  File "C:\Users\TheoLC\Desktop\python\twitter_search\collect+200tw.py", line 52, in <module>
    tweet['text'] = translator.translate(tweet['text'], src='fi', dest='en')
TypeError: 'Status' object is not subscriptable

A huge thanks to those who will be able to help me

score 0 · Accepted Answer · answered May 06 '19 at 21:23

Both errors are related, and have to do with the fact that you're trying to access tweet['text'].

When you iterate over a file object, each item is a string. (More specifically, a line of text from the file.) So in the first code sample, tweet is a string, and there is no such thing as tweet['text']

with open("%s_tweets.json" % search_word, 'r+') as f:
    for tweet in f:
        # do stuff with tweet (a string)

In the second sample, I'm not sure what kind of data structure new_tweets_fi and new_tweets_fr are, but it looks like when you iterate over them, you get a Status object. I'm also not sure what that object looks like, but whatever it is, you can't index it like you can a string or a dictionary. (See In Python, what does it mean if an object is subscriptable or not?)

how to tokenize a list of tweets without his errors?

1 Answers1