0

I have a dataframe in which each row shows one transaction and items within that transactions. Here is how my dataframe looks like

itemList
A,B,C
B,F
G,A
...

I want to find the frequency of each item (how many times it appeared in the transactions. I have defined a dictionary and try to update its value as shown below

dict ={}
def update(itemList):
   #Update the value of each item in the dict

df.itemList.apply(lambda x: update(x))

As apply function gets executed for multiple row at the same time, multiple rows try to update the values in dict at the same time and it's causing an issue. How can I make sure multiple updated to dict does not cause any issue?

HHH
  • 6,085
  • 20
  • 92
  • 164
  • 1
    Why do you think *multiple rows try .. at the same time*? `apply` is just a for loop. – Quang Hoang Mar 11 '20 at 20:19
  • As per [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) article, please provide a reproducible sample. By this I mean: a sample dataset we can copy/paste, the output of what you are getting, and a sample of what you want to have as output. – Ukrainian-serge Mar 11 '20 at 20:22
  • You don't need a lambda expression anymore. `df.itemList.apply(update)`. – chepner Mar 11 '20 at 20:28

2 Answers2

1

I think you only need Series.str.get_dummies:

df['itemList'].str.get_dummies(',').sum().to_dict()
#{'A': 2, 'B': 2, 'C': 1, 'F': 1, 'G': 1}

If there are more columns use:

df.stack().str.get_dummies(',').sum().to_dict()

if you want to count for each row:

df['itemList'].str.get_dummies(',').to_dict('index')
#{0: {'A': 1, 'B': 1, 'C': 1, 'F': 0, 'G': 0},
# 1: {'A': 0, 'B': 1, 'C': 0, 'F': 1, 'G': 0},
# 2: {'A': 1, 'B': 0, 'C': 0, 'F': 0, 'G': 1}}

As @Quang Hoang said in the comments apply simply apply the function to each row / column using a loop

ansev
  • 30,322
  • 5
  • 17
  • 31
0

You might be better off relying on native python here,

df = pd.DataFrame({'itemlist':['a,b,c', 'b,f', 'g,a', 'd,g,f,d,s,a,v', 'e,w,d,f,g,h', 's,d,f,e,r,t', 'e,d,f,g,r,r','s,d,f']})

Here is a solution using Counter,

df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()

Some comparisons,

%timeit df['itemlist'].str.split(',', expand = True).stack().value_counts().to_dict()
2.64 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df['itemlist'].str.get_dummies(',').sum().to_dict()
3.22 ms ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

from collections import Counter
%timeit df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()
778 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Vaishali
  • 37,545
  • 5
  • 58
  • 86