0

As the title suggests, I'm processing several million tweets and one of the data points is whether or not any of the words exist in two different lists (each list contains about 500 words). It's understandably pretty slow, but I'll be doing this regularly so I'd like to speed it up. Any thoughts on how I could so?

lista = ['word1', 'word2', ... 'word500']
listb = ['word1', 'word2', ..., 'word500']

def token_list_count(df):

    for i, t in df.iterrows():

        list_a = 0
        list_b = 0

        for tok in t['tokens']:
            if tok in lista: list_a += 1
            elif tok in listb: list_b += 1

        df.loc[i, 'token_count'] = int(len(t['tokens']))
        df.loc[i, 'lista_count'] = int(list_a)
        df.loc[i, 'listb_count'] = int(list_b)

        if i % 25000 == 0: print('25k more processed...')

    return df

Edit:

Input / Before:

enter image description here

Output / After:

enter image description here

Zach
  • 1,243
  • 5
  • 19
  • 28
  • I'm not sure how well python optimizes something like `tok in lista` but you might make `lista` a set instead of a list and measure if that improves things. – President James K. Polk Apr 17 '17 at 17:47
  • Please provide sample data and expected output. See: http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – root Apr 17 '17 at 17:53
  • Use the *list*.**count** method. – Prune Apr 17 '17 at 18:02
  • `x in s` has a O(n) complexity for lists and O(1) for sets. As @JamesKPolk mentioned, change `lista` and `listb` to sets (https://wiki.python.org/moin/TimeComplexity) – VeGABAU Apr 17 '17 at 18:10
  • Also, maybe you can take a look at `multiprocessing` package to use concurrency for your task – VeGABAU Apr 17 '17 at 18:17
  • Thanks! Using sets helped to give a 4.5% increase in speed for a 50 tweet sample and a 2.7% increase for a 100 tweet sample.. So varying success but something is better than nothing :) – Zach Apr 17 '17 at 18:31
  • I would note, however, that the total time for 50 tweets is: ~30 sec and for 100 tweets: ~59 seconds --- so I'm currently at roughly 100 tweets / min haha – Zach Apr 17 '17 at 18:32
  • I suggest you do some testing with doing the whole thing outside of Python as suggested in my answer. I bet you will be surprised how much time you can save. – Claudio Apr 17 '17 at 18:50
  • How does it come that the time for 50 tweets is ~30 sec ...??? It is something going wrong. From what I see what should be deon, it can't be ... the timing problem is "hidden" in `df`? Are you using pandas? – Claudio Apr 17 '17 at 19:06
  • Please include copy-pastable example data. Images are not helpful. – root Apr 17 '17 at 19:07
  • 1
    Python level optimizations (e.g. set vs. list) will probably not yield significant performance improvements. The biggest issue here is that you're iterating over a DataFrame, which should almost always be avoided. Using a vectorized approach is what will yield the best performance improvements. – root Apr 17 '17 at 19:09
  • Please try the change I proposed in my answer (marked with "!!!") and come back to tell if it had an effect on speed. – Claudio Apr 17 '17 at 19:25

2 Answers2

3

In order to avoid iterating over a dataframe you might try the function below. The first line expands the tokens column into a dataframe where every first token will be in the first column, every second one in the second column and so on. Total number of columns will be equal to the largest token_count in df. From there it's easy to calculate lista and listb counts using .isin method and summing that for every row.

The tok dataframe made from full df might take a lot of memory in which case it might be necessary to split the df into several chunks and process them separately.

def token_list_count(df):
    tok = df['tokens'].apply(pd.Series)
    df['token_count'] = tok.notnull().sum(axis=1)
    df['lista_count'] = tok.isin(lista).sum(axis=1)
    df['listb_count'] = tok.isin(listb).sum(axis=1)
    return df

Your original function also has some room for improvement:

  • it's not necessary to iterate over full df using iterrows() because you only need one column to do the calculations. So you can use df.tokens.iteritems() instead. This will save you creating a new Series object at every iteration.
  • since you're only setting new values one at a time you can use at indexer instead of loc. See Fast scalar value getting and setting in pandas docs.

I don't know how much this will help but probably some =).

def token_list_count(df):
    for i, tok in df.tokens.iteritems():
        list_a = sum((t in lista) for t in tok)
        list_b = sum((t in listb) for t in tok)

        df.at[i, 'token_count'] = int(len(tok))
        df.at[i, 'lista_count'] = int(list_a)
        df.at[i, 'listb_count'] = int(list_b)

        if i % 25000 == 0: print('25k more processed...')
    return df

You could also write this without a for loop but using apply:

df['token_count'] = df.tokens.apply(len)
df['lista_count'] = df.tokens.apply(lambda tok: sum((t in lista) for t in tok))
df['listb_count'] = df.tokens.apply(lambda tok: sum((t in listb) for t in tok))
gereleth
  • 2,452
  • 12
  • 21
  • Wow! Significant improvement. Down to ~1 sec / 100 tweets. Thank you!! Would this be thaaat much better than @root & Claudio's suggestion of vectorizing? I feel like in a way it's accomplishing something similar. Either way, thanks for your help! – Zach Apr 17 '17 at 20:45
  • @zacheism, this is something like vectorizing, yes. I expanded the answer with a few suggestions for improving your original function. If you time this version let me know how it compares to the others =). – gereleth Apr 17 '17 at 21:25
  • I appreciate the expansion! It didn't really occur to me that I'd be creating a new Series object at each iteration so thank you for that as well. As for performance (w/ 100k tweet sample) -- your version: 27.8 sec, your/mine hybrid version: 30.1 sec, list comprehension: 24.1 sec. Of course the version with my code in it is the slowest ;) thanks again for your help! – Zach Apr 17 '17 at 22:01
2

Using the code below should give you some tiny time advantage. NOTICE that your count of occurrences in listb is not correct if you use this code below as it is.

You don't get much speedup from the code below because creating the sets will take away almost all of the time advantage coming from faster lookups, BUT you will save time if you reuse seta, setb also for another lookup loops if further code.

Maybe you can arrange, that you get your lists directly as sets, so you save the time for creating the sets?

lista = ['word1a', 'word2a', 'word500a']
listb = ['word1b', 'word2a', 'word500b']

seta = set(lista)
setb = set(listb)

def token_list_count(df):

    dfIterrows = df.iterrows() # TRY THIS !!! 
    for i, t in dfIterrows:    # TRY THIS !!! 
        list_a = 0
        list_b = 0
        tTokens = t['tokens']  # TRY THIS !!!
        for tok in tTokens:    # TRY THIS !!!
            if   tok in seta: list_a += 1
            elif tok in setb: list_b += 1
            # why not "if tok in setb: list_b +=1" ???
        df.loc[i, 'token_count'] = int(len(t['tokens']))
        df.loc[i, 'lista_count'] = int(list_a)
        df.loc[i, 'listb_count'] = int(list_b)

        if i % 25000 == 0: print('25k more processed...')

    return df

If the speed really matters to you I suggest you use Python only as a frame and do the lookups using appropriate executables or a a database query in addition to multiprocessing used for lookups in seta and setb.

Another option is to consider what root suggested in a comment: """The biggest issue here is that you're iterating over a DataFrame, which should almost always be avoided. Using a vectorized approach is what will yield the best performance improvements. """

Claudio
  • 7,474
  • 3
  • 18
  • 48
  • why do you cast a list to a tuple before casting to a set? It's not needed – VeGABAU Apr 17 '17 at 18:15
  • @VeGABAU: thanks ... it has been there because I used initially another construct ... – Claudio Apr 17 '17 at 18:18
  • You can convert your for loops to comprehensions, and can use a counter instead of adding one to the index for each call of the for loop. –  Apr 17 '17 at 20:30
  • 1
    Thanks for your help! Your most recent edit gave a pretty good boost (-36% reduction in time) but gereleth's solution was ~98% faster. – Zach Apr 17 '17 at 20:49
  • @zacheism Thanks for the feedback. I am not pandas expert (yet), but my feeling was that there is something to get from it as your timings were too horrible for the small data amount processed. Isn't it surprising that my suggestion gave a reduction in time? As I discovered it the first time that in some configurations it might be an issue I was quite perplexed ... With the use of pandas internal mechanism `.isin()` you move the processing of your data out of Python to C-compiled modules - probably you are now at the limit of what could be achieved or at least near to it. – Claudio Apr 17 '17 at 21:25