6

I have this Pandas DataFrame that has a column with lists:

>>> df = pd.DataFrame({'m': [[1,2,3], [5,3,2], [2,5], [3,8,1], [9], [2,6,3]]})
>>> df
           m
0  [1, 2, 3]
1  [5, 3, 2]
2     [2, 5]
3  [3, 8, 1]
4        [9]
5  [2, 6, 3]

I want to count the number of times a list v = [2, 3] is contained in the lists of the DataFrame. So in this example the correct answer would be 3. Now this is just an example, in my actual data the df['m'] can contain more than 9 million rows and the lists are actually lists of strings with up to about 20 elements. Some more details if it matters: The elements of v contain no duplicates and neither do the lists of m, so they can be sets instead of lists.

The first iteration of my program iterated over each row and checked all(e in data['m'][i] for e in v) and if that's True, I increment a counter. But as addressed in many SO questions and blog posts, iterating over the rows of a DataFrame is slow and can be done much faster.

So for my next iteration I added a column to the DataFrame that contains a copy of the list v:

>>> df['V'] = [[2, 3]] * len(df)
>>> df
        V          m
0  [2, 3]  [1, 2, 3]
1  [2, 3]  [5, 3, 2]
2  [2, 3]     [2, 5]
3  [2, 3]  [3, 8, 1]
4  [2, 3]        [9]
5  [2, 3]  [2, 6, 3]

and a helper function that simply returns the containment boolean like I did before:

def all_helper(l1, l2):
    return all(v in l1 for v in l2)

which I can then use with np.vectorize to add a column with the boolean value:

df['bool'] = np.vectorize(all_helper)(df['m'], df['V'])

And lastly, calculate the sum of these booleans with a simple df['bool'].sum()

I also tried to use .apply():

df['bool'] = df.apply(lambda row: all(w in row['m'] for w in v), axis=1)
count = df['bool'].sum()

but this was slower than the vectorisation.

Now these methods work, and the vectorisation is much faster than the initial approach, but it feels a bit clunky (creating a column with identical values, using a helper function in such a way). So my question, performance is key, is there a better/faster way to count the number of times a list is contained in a column of lists? Since the lists contain no duplicates, perhaps the check if len(union(df['m'], df['V'])) == len(df['m']) or something, but I don't know how and if that's the best solution.

Edit: Since somebody asked; here's an example with strings instead of integers:

>>> df = pd.DataFrame({'m': [["aa","ab","ac"], ["aa","ac","ad"], ["ba","bb"], ["ac","ca","cc"], ["aa"], ["ac","da","aa"]]})
>>> v = ["aa", "ac"]
>>> df
                    m
0  ["aa", "ab", "ac"]
1  ["aa", "ac", "ad"]
2        ["ba", "bb"]
3  ["ac", "ca", "cc"]
4              ["aa"]
5  ["ac", "da", "aa"]

>>> count_occurrence(df, v)
3

But if you want a more extensive DataFrame, you can generate it with this:

import string

n = 10000
df = pd.DataFrame({'m': [list(set([''.join(np.random.choice(list(string.ascii_lowercase)[:5], np.random.randint(3, 4))) for _ in range(np.random.randint(1, 10))])) for _ in range(n)]})
v = ["abc", 'cde']
print(count_occurrence(df, v))

Edit: Neither Divakar's or Vaishali's solution was faster than the one that uses np.vectorize. Wonder if anyone can beat it.

Jon Clements came with a solution that is roughly 30% faster and much cleaner: df.m.apply(set(v).issubset).sum(). I continue looking for faster implementations, but this is a step in the right direction.

Jurgy
  • 2,128
  • 1
  • 20
  • 33

2 Answers2

5

You can utilise DataFrame.apply along with the builtin set.issubset method and then .sum() which all operate at a lower level (normally C level) than Python equivalents do.

subset_wanted = {2, 3}
count = df.m.apply(subset_wanted.issubset).sum()

I can't see shaving more time off that than writing a custom C-level function which'd be the equivalent of a custom sum with a check there's a subset to determine 0/1 on a row by row basis. At which point, you could have run this thousands upon thousands of times anyway.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Got me thinking though... might be nice to have something like `count = df.m.count_if(...)` in pandas.... or possibly `count_where` sticking with naming conventions. – Jon Clements Nov 22 '17 at 09:56
  • Unless the ``count_where`` function offers some optimization, be it speed or memory wise, I don't see a reason to add this method to the package since as you've shown it can already be done very concisely. – Jurgy Nov 22 '17 at 10:10
  • @Jurgy I imagine it'd be along the same lines as how the builtin `map` / `filter` works... providing a one off reference to an equivalently low level bound function *might* work (as opposed to passing Python level or user level ufuncs). I admit to not thinking it through thoroughly. but if passed something like `str.lower`, `set(something).some_method` etc... might just work – Jon Clements Nov 22 '17 at 10:17
  • In principle since the end result will always be a scalar, you don't need an entire Series/ndarray of the mask/boolean results to sum, so you move the computation of that into the sum/count itself - while I *have* to iterate over the series/ndarray at whatever level to get the mask returned, if I *know* I only want an aggregate value, I can compute both in that loop itself – Jon Clements Nov 22 '17 at 10:21
  • The difference between `select count(*) from table where something..." and "select count(*) from (select 1 from something where condition=...)` – Jon Clements Nov 22 '17 at 10:22
  • You've convinced me. If you're really confident in your idea you can implement it and create a pull-request for the pandas-dev github repo ;) – Jurgy Nov 22 '17 at 10:33
  • @Jurgy heck no! While I've got over 10 years of Python under my belt - and fairly au fait with its inner workings , I've only really used pandas for a couple of years and still consider myself a beginner (and never delved deep into how a lot of things work at an implementation level). Just one of those "off the top of my head" things - fairly sure the development crew would have already thought about it/dismissed it. There's a friendly crowd in [the Python chatroom](https://chat.stackoverflow.com/rooms/6/python) including a few very experienced numpy/pandas people... Might be worth saying hi. – Jon Clements Nov 22 '17 at 10:40
0

Since you are looking more a set-like behavior

(df.m.apply(lambda x: set(x).intersection(set([2,3]))) == set([2,3])).sum()

Returns

3
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • Thanks, but according to blogs like [this one](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6) looping with ``apply`` is slower than vectorization which is why I went for the aproach I mentioned. I am not at work right now, so can't check if your solution is faster. One last thing: shouldn't that last ``set([2,3])`` after the equals operator be ``set(x)``? – Jurgy Nov 21 '17 at 18:17
  • Back at work; this approach is roughly 60% slower than my vectorize approach. On a dataset of 10.000 rows, my approach took 0.7 seconds and yours took 1.7 seconds. – Jurgy Nov 22 '17 at 08:43
  • 1
    @Jurgy what about something like `df.m.apply({2, 3}.issubset).sum()` – Jon Clements Nov 22 '17 at 08:55
  • @Jon Clements that solution is a bit faster and much cleaner too, thank you very much! 0.40 seconds versus 0.64 seconds on the same dataset. I keep looking for faster solutions, but this is a step in the right direct. – Jurgy Nov 22 '17 at 09:07
  • @Jurgy I'll add it as a possible answer then? I can't think of anyway it can go faster than that though... you're using the vectorized apply (C-level) with a builtin `set` method (C-level) and then summing (C-level)... The other only way to potentially speed it up is to avoid generating the complete result of the `apply` which while avoiding the memory overhead will move the iteration back into a Python level for-loop so you lose all advantages there. – Jon Clements Nov 22 '17 at 09:13
  • I guess you are right. You can add it as an answer. – Jurgy Nov 22 '17 at 09:24