I have a large DataFrame with 100 million records, I am trying to optimize the run time by using numpy
.
Sample data:
dat = pd.DataFrame({'ID' : [1,2,3,4,5],
'item' : ['beauty', 'beauty', 'shoe','shoe','handbag'],
'mylist' : [['beauty','something'], ['shoe', 'something', 'else'], ['shoe', 'else','some'], ['else'], ['some', 'thing', 'else']]})
dat
ID item mylist
0 1 beauty [beauty, something]
1 2 beauty [shoe, something, else]
2 3 shoe [shoe, else, some]
3 4 shoe [else]
4 5 handbag [some, thing, else]
I am trying to filter those rows where item
column's string exists in mylist
column using:
dat[np.where(dat['item'].isin(dat['mylist']), True, False)]
But I am not getting any output and all of above values as False
.
I could get the required results using:
dat[dat.apply(lambda row : row['item'] in row['mylist'], axis = 1)]
ID item mylist
0 1 beauty [beauty, something]
2 3 shoe [shoe, else, some]
But as numpy
operations are faster, I am trying to use np.where
. Could someone please let me know who to fix the code?