filtering of tweets received from statuses/filter (streaming API)

Question

I have N different keywords that i am tracking (for sake of simplicity, let N=3). So in GET statuses/filter, I will give 3 keywords in the "track" argument.

Now the tweets that i will be receiving can be from ANY of the 3 keywords that i mentioned. The problem is that i want to resolve as to which tweet corresponds to which keyword. i.e. mapping between tweets and the keyword(s) (that are mentioned in the "track" argument).

Apparently, there is no way to do this without doing any processing on the tweets received.

So i was wondering what is the best way to do this processing? Search for keywords in the text of the tweet? What about case-insensitive? What about when multiple words are there in same keyword, e.g: "Katrina Kaif" ?

I am currently trying to formulate some regular expression...

I was thinking the BEST way would to use the same logic (regular expressions etc.) as is used originally be statuses/filter API. How to know what logic is used by Twitter API statuses/filter itself to match tweets to the keywords ?

Advice? Help?

P.S.: I am using Python, Tweepy, Regex, MongoDb/Apache S4 (for distributed computing)

For larger N regular expression might be quite pain. The most simple way would be to transform the text into lower-case and for each keyword check tweet for its existence. If you wanna check for exact matching then you might tokenize your tweets and get the intersection of your keyword set and the token set. The intersection will be the keywords matching the tweet. — cubbuk, May 17 '13 at 07:43
@cubbuk : Currently, I have N = 100. It is preferable to search for keyword only in the "text" part of tweet, right? — user1599964, May 17 '13 at 08:20
Yeah as far as I know twitter matches the text part of the tweet only, so checking the text part will be more suitable for you. — cubbuk, May 17 '13 at 08:47
@user1599964 I have the same use case. Did you settle on a solution? If so, do you mind sharing your approach? — WGriffing, May 01 '20 at 21:09

score 2 · Answer 1 · answered May 17 '13 at 11:41

The first thing coming into my mind is to create a separate stream for every keyword and start it in a separate thread, like this:

from threading import Thread
import tweepy


class StreamListener(tweepy.StreamListener):
    def __init__(self, keyword, api=None):
        super(StreamListener, self).__init__(api)
        self.keyword = keyword

    def on_status(self, tweet):
        print 'Ran on_status'

    def on_error(self, status_code):
        print 'Error: ' + repr(status_code)
        return False

    def on_data(self, data):
        print self.keyword, data
        print 'Ok, this is actually running'


def start_stream(auth, track):
    tweepy.Stream(auth=auth, listener=StreamListener(track)).filter(track=[track])


auth = tweepy.OAuthHandler(<consumer_key>, <consumer_secret>)
auth.set_access_token(<key>, <secret>)

track = ['obama', 'cats', 'python']
for item in track:
    thread = Thread(target=start_stream, args=(auth, item))
    thread.start()

If you still want to distinguish tweets by keywords by yourself in a single stream, here's some info on how twitter uses track request parameter. There are some edge cases that could cause problems.

Hope that helps.

The thing is that twitter API suggests us that we should try to reduce the number of INDIVIDUAL streams as far as possible. Because if there are too many stream connections from same IP/account, then it will get blacklisted. See this: https://dev.twitter.com/discussions/921 — user1599964, May 17 '13 at 19:34
Yeah, right, this is not an option generally, thanks for sharing. — alecxe, May 17 '13 at 20:01
Hmm... well i guess, for now i will just have to stick to matching EACH keyword (after making it case-insensitive) with text of EACH tweet, so as to form mapping between tweet and keyword(s). — user1599964, May 17 '13 at 20:23

Vlox · Answer 2 · 2017-04-27T16:10:12.043

Return list of any/all 'triggered' track terms

I had a very related issue and i solved it by list comprehension. That is, I had a list of raw tweets, and my track filter terms as 'listoftermstofind' and 'rawtweetlist'. Then you can run the following to return a list of lists of any and all track terms that were found in each tweet.

j=[x.upper() for x in listoftermstofind] #your track filters, but making case insensitive
ListOfTweets=[x.upper() for x in rawtweetlist] #converting case to upper for all tweets
triggers=list(map(lambda y: list(filter(lambda x: x in y, j)), ListOfTweets))

This works well, because the track filters in the API are specific (down to the character level) rather than any natural language search processing or anything like that. I recommend reading the API docs on filtering in detail, it goes through the usage quite well: https://dev.twitter.com/streaming/overview/request-parameters

filtering of tweets received from statuses/filter (streaming API)

2 Answers2

Linked