1

I have a large DataFrame with 100 million records, I am trying to optimize the run time by using numpy.

Sample data:

dat = pd.DataFrame({'ID' : [1,2,3,4,5],
                   'item' : ['beauty', 'beauty', 'shoe','shoe','handbag'],
                   'mylist' : [['beauty','something'], ['shoe', 'something', 'else'], ['shoe', 'else','some'], ['else'], ['some', 'thing', 'else']]})


dat

    ID  item    mylist
0   1   beauty  [beauty, something]
1   2   beauty  [shoe, something, else]
2   3   shoe    [shoe, else, some]
3   4   shoe    [else]
4   5   handbag [some, thing, else]

I am trying to filter those rows where item column's string exists in mylist column using:

dat[np.where(dat['item'].isin(dat['mylist']), True, False)]

But I am not getting any output and all of above values as False. I could get the required results using:

dat[dat.apply(lambda row : row['item'] in row['mylist'], axis = 1)]

    ID  item    mylist
0   1   beauty  [beauty, something]
2   3   shoe    [shoe, else, some]

But as numpy operations are faster, I am trying to use np.where. Could someone please let me know who to fix the code?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • 1
    There's no way to vectorise this operation with numpy. Nested lists in a column is not _really_ the best use case for pandas. A comprehension with zip ([like here](/a/69771593/15497888)) is probably the best option, but it's still going to be rather slow at that scale. If possible restructuring the data would be a better long-term solution. – Henry Ecker Oct 28 '22 at 04:03

1 Answers1

1

You can't vectorize easily with Series of lists, you can use a list comprehension to be a bit faster than apply:

out = dat.loc[[item in l for item,l in zip(dat['item'], dat['mylist'])]]

A vectorial solution would be:

out = dat.loc[dat.explode('mylist').eval('item == mylist').groupby(level=0).any()]

# or
out = dat.explode('mylist').query('item == mylist').groupby(level=0).first()

# or, if you are sure that there is at most 1 match
out = dat.explode('mylist').query('item == mylist')

But the explode step might be a bottleneck. You must try with your real data.

output:

   ID    item               mylist
0   1  beauty  [beauty, something]
2   3    shoe   [shoe, else, some]

timing

I ran a quick test on 100k rows (using df = pd.concat([dat]*20000, ignore_index=True))

  • the list comprehension is the fastest (~20ms)
  • explode approaches are between 60-90ms (explode itself requiring 40ms)
  • apply is by far the slowest (almost 600ms)
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks, will test for speed, but for my understanding Is `numpy` vectorization not possible with list columns? – Karthik S Oct 28 '22 at 04:03
  • 1
    I added a vectorial alternative, but it might not be faster (requires to explode, compare, group, aggregate), you can try and let me know – mozway Oct 28 '22 at 04:07
  • quick test on 100k rows, the list comprehension is the fastest (~20ms), the `explode` approaches are between 60-90ms (`explode` itself requiring 40ms), and `apply` is by far the slowest (almost 600ms) – mozway Oct 28 '22 at 04:19