Processing several million tweets - loop is very slow. Any ideas on how to speed it up?

Question

As the title suggests, I'm processing several million tweets and one of the data points is whether or not any of the words exist in two different lists (each list contains about 500 words). It's understandably pretty slow, but I'll be doing this regularly so I'd like to speed it up. Any thoughts on how I could so?

lista = ['word1', 'word2', ... 'word500']
listb = ['word1', 'word2', ..., 'word500']

def token_list_count(df):

    for i, t in df.iterrows():

        list_a = 0
        list_b = 0

        for tok in t['tokens']:
            if tok in lista: list_a += 1
            elif tok in listb: list_b += 1

        df.loc[i, 'token_count'] = int(len(t['tokens']))
        df.loc[i, 'lista_count'] = int(list_a)
        df.loc[i, 'listb_count'] = int(list_b)

        if i % 25000 == 0: print('25k more processed...')

    return df

Edit:

Input / Before:

Output / After:

I'm not sure how well python optimizes something like `tok in lista` but you might make `lista` a set instead of a list and measure if that improves things. — President James K. Polk, Apr 17 '17 at 17:47
Please provide sample data and expected output. See: http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — root, Apr 17 '17 at 17:53
`x in s` has a O(n) complexity for lists and O(1) for sets. As @JamesKPolk mentioned, change `lista` and `listb` to sets (https://wiki.python.org/moin/TimeComplexity) — VeGABAU, Apr 17 '17 at 18:10
Also, maybe you can take a look at `multiprocessing` package to use concurrency for your task — VeGABAU, Apr 17 '17 at 18:17
Thanks! Using sets helped to give a 4.5% increase in speed for a 50 tweet sample and a 2.7% increase for a 100 tweet sample.. So varying success but something is better than nothing :) — Zach, Apr 17 '17 at 18:31
I would note, however, that the total time for 50 tweets is: ~30 sec and for 100 tweets: ~59 seconds --- so I'm currently at roughly 100 tweets / min haha — Zach, Apr 17 '17 at 18:32
I suggest you do some testing with doing the whole thing outside of Python as suggested in my answer. I bet you will be surprised how much time you can save. — Claudio, Apr 17 '17 at 18:50
How does it come that the time for 50 tweets is ~30 sec ...??? It is something going wrong. From what I see what should be deon, it can't be ... the timing problem is "hidden" in `df`? Are you using pandas? — Claudio, Apr 17 '17 at 19:06
Please include copy-pastable example data. Images are not helpful. — root, Apr 17 '17 at 19:07
Python level optimizations (e.g. set vs. list) will probably not yield significant performance improvements. The biggest issue here is that you're iterating over a DataFrame, which should almost always be avoided. Using a vectorized approach is what will yield the best performance improvements. — root, Apr 17 '17 at 19:09
Please try the change I proposed in my answer (marked with "!!!") and come back to tell if it had an effect on speed. — Claudio, Apr 17 '17 at 19:25

gereleth · Accepted Answer · 2017-04-17T21:19:33.147

In order to avoid iterating over a dataframe you might try the function below. The first line expands the tokens column into a dataframe where every first token will be in the first column, every second one in the second column and so on. Total number of columns will be equal to the largest token_count in df. From there it's easy to calculate lista and listb counts using .isin method and summing that for every row.

The tok dataframe made from full df might take a lot of memory in which case it might be necessary to split the df into several chunks and process them separately.

def token_list_count(df):
    tok = df['tokens'].apply(pd.Series)
    df['token_count'] = tok.notnull().sum(axis=1)
    df['lista_count'] = tok.isin(lista).sum(axis=1)
    df['listb_count'] = tok.isin(listb).sum(axis=1)
    return df

Your original function also has some room for improvement:

it's not necessary to iterate over full df using iterrows() because you only need one column to do the calculations. So you can use df.tokens.iteritems() instead. This will save you creating a new Series object at every iteration.
since you're only setting new values one at a time you can use at indexer instead of loc. See Fast scalar value getting and setting in pandas docs.

I don't know how much this will help but probably some =).

def token_list_count(df):
    for i, tok in df.tokens.iteritems():
        list_a = sum((t in lista) for t in tok)
        list_b = sum((t in listb) for t in tok)

        df.at[i, 'token_count'] = int(len(tok))
        df.at[i, 'lista_count'] = int(list_a)
        df.at[i, 'listb_count'] = int(list_b)

        if i % 25000 == 0: print('25k more processed...')
    return df

You could also write this without a for loop but using apply:

df['token_count'] = df.tokens.apply(len)
df['lista_count'] = df.tokens.apply(lambda tok: sum((t in lista) for t in tok))
df['listb_count'] = df.tokens.apply(lambda tok: sum((t in listb) for t in tok))

Wow! Significant improvement. Down to ~1 sec / 100 tweets. Thank you!! Would this be thaaat much better than @root & Claudio's suggestion of vectorizing? I feel like in a way it's accomplishing something similar. Either way, thanks for your help! — Zach, Apr 17 '17 at 20:45
@zacheism, this is something like vectorizing, yes. I expanded the answer with a few suggestions for improving your original function. If you time this version let me know how it compares to the others =). — gereleth, Apr 17 '17 at 21:25
I appreciate the expansion! It didn't really occur to me that I'd be creating a new Series object at each iteration so thank you for that as well. As for performance (w/ 100k tweet sample) -- your version: 27.8 sec, your/mine hybrid version: 30.1 sec, list comprehension: 24.1 sec. Of course the version with my code in it is the slowest ;) thanks again for your help! — Zach, Apr 17 '17 at 22:01

Claudio · Answer 2 · 2017-04-17T19:31:03.993

Using the code below should give you some tiny time advantage. NOTICE that your count of occurrences in listb is not correct if you use this code below as it is.

You don't get much speedup from the code below because creating the sets will take away almost all of the time advantage coming from faster lookups, BUT you will save time if you reuse seta, setb also for another lookup loops if further code.

Maybe you can arrange, that you get your lists directly as sets, so you save the time for creating the sets?

lista = ['word1a', 'word2a', 'word500a']
listb = ['word1b', 'word2a', 'word500b']

seta = set(lista)
setb = set(listb)

def token_list_count(df):

    dfIterrows = df.iterrows() # TRY THIS !!! 
    for i, t in dfIterrows:    # TRY THIS !!! 
        list_a = 0
        list_b = 0
        tTokens = t['tokens']  # TRY THIS !!!
        for tok in tTokens:    # TRY THIS !!!
            if   tok in seta: list_a += 1
            elif tok in setb: list_b += 1
            # why not "if tok in setb: list_b +=1" ???
        df.loc[i, 'token_count'] = int(len(t['tokens']))
        df.loc[i, 'lista_count'] = int(list_a)
        df.loc[i, 'listb_count'] = int(list_b)

        if i % 25000 == 0: print('25k more processed...')

    return df

If the speed really matters to you I suggest you use Python only as a frame and do the lookups using appropriate executables or a a database query in addition to multiprocessing used for lookups in seta and setb.

Another option is to consider what root suggested in a comment: """The biggest issue here is that you're iterating over a DataFrame, which should almost always be avoided. Using a vectorized approach is what will yield the best performance improvements. """

why do you cast a list to a tuple before casting to a set? It's not needed — VeGABAU, Apr 17 '17 at 18:15
@VeGABAU: thanks ... it has been there because I used initially another construct ... — Claudio, Apr 17 '17 at 18:18
You can convert your for loops to comprehensions, and can use a counter instead of adding one to the index for each call of the for loop. — , Apr 17 '17 at 20:30
Thanks for your help! Your most recent edit gave a pretty good boost (-36% reduction in time) but gereleth's solution was ~98% faster. — Zach, Apr 17 '17 at 20:49
@zacheism Thanks for the feedback. I am not pandas expert (yet), but my feeling was that there is something to get from it as your timings were too horrible for the small data amount processed. Isn't it surprising that my suggestion gave a reduction in time? As I discovered it the first time that in some configurations it might be an issue I was quite perplexed ... With the use of pandas internal mechanism `.isin()` you move the processing of your data out of Python to C-compiled modules - probably you are now at the limit of what could be achieved or at least near to it. — Claudio, Apr 17 '17 at 21:25

Processing several million tweets - loop is very slow. Any ideas on how to speed it up?

2 Answers2