Speeding up pandas dataframe subsetting with cython/numpy/by using another package

Question

In my code, I have many dataframe subset operations on more or less large data frames. Unfortunately, the df columns contain lists; I do not insist of storing the data with pandas, however I haven't found a better option yet. The dataframes follow this principle, but they can get very large:

list_column_one       list_column_two          other_column_1  other_column2
["apple", "orange"]   ["cucumber", "tomato"]   1               "bread"

I tried subsetting them like this, when the subset should contain a certain value in a non-list column:

df[[d == some_value for d in df["other_column_1"]]]

Like this when the subset should contain a certain value in a list column:

df.loc[df["list_column_1"].map(lambda d: some_value in d)]

Or like this, when the list in a column should be a subset of another list:

from collections import Counter
#source: https://stackoverflow.com/a/15147825/7253302
def counterSubset(list1, list2):
   c1, c2 = Counter(list1), Counter(list2)
   for k, n in c1.items():
      if n > c2[k]:
        return False
   return True

important_list = ["apple", "orange", "bear"]
df[[counterSubset(d, important_list) for d in df["list_column_one"]]]

But all of these still slow down the code massively because they are executed so often. Is there any way to use cython/numpy/another package for data storage in order to speed up lookups?

Maybe not clever question but is possible explode lists? Like [this](https://stackoverflow.com/a/53218939/2901002) solution for multiple columns? — jezrael, Dec 11 '20 at 09:33
because for pandas lists like here in cells are not `'native'` format, so not easy vectorize it. — jezrael, Dec 11 '20 at 09:34
@jezrael I could only do this if I then had an index for list_column_one and list_column_two which still showed which ones were in one list before. To clarify: The two list columns specify educts and products in a reaction, so it has to be clear that they belong together. — LizzAlice, Dec 11 '20 at 09:40

IoaTzimas · Answer 1 · 2020-12-11T10:00:44.180

1

Try the following:

df[df['list_column_one'].apply(lambda x: some_value in x)]

or

df[[some_value in x for x in df['list_column_one'].values]]

Let me know if there is improvement of performance

edited Dec 11 '20 at 10:00

answered Dec 11 '20 at 09:52

IoaTzimas

10,538
2
13
30

Sorry, I had an error in my question, for cases like this, I already used lambda. However, the second approach is about 100 micro seconds faster so that's something! – LizzAlice Dec 11 '20 at 10:05

Speeding up pandas dataframe subsetting with cython/numpy/by using another package

1 Answers1