In my code, I have many dataframe subset operations on more or less large data frames. Unfortunately, the df columns contain lists; I do not insist of storing the data with pandas, however I haven't found a better option yet. The dataframes follow this principle, but they can get very large:
list_column_one list_column_two other_column_1 other_column2
["apple", "orange"] ["cucumber", "tomato"] 1 "bread"
I tried subsetting them like this, when the subset should contain a certain value in a non-list column:
df[[d == some_value for d in df["other_column_1"]]]
Like this when the subset should contain a certain value in a list column:
df.loc[df["list_column_1"].map(lambda d: some_value in d)]
Or like this, when the list in a column should be a subset of another list:
from collections import Counter
#source: https://stackoverflow.com/a/15147825/7253302
def counterSubset(list1, list2):
c1, c2 = Counter(list1), Counter(list2)
for k, n in c1.items():
if n > c2[k]:
return False
return True
important_list = ["apple", "orange", "bear"]
df[[counterSubset(d, important_list) for d in df["list_column_one"]]]
But all of these still slow down the code massively because they are executed so often. Is there any way to use cython/numpy/another package for data storage in order to speed up lookups?