self learner in python, I am trying to improve so any help is very welcome, thanks lot ! I want to compute a jaccard similarity over a column of my dataframe by matching criteria on another column. df looks like this:
name bag number item quantity
sally 1 BANANA 3
sally 2 BREAD 1
franck 3 BANANA 2
franck 3 ORANGE 1
franck 3 BREAD 4
robert 4 ORANGE 3
jenny 5 BANANA 4
jenny 5 ORANGE 2
With about 80 categorical of items, bag number (sample) is unique to one shoper, but they can have more than one and quantities range from 0 to 4. I would like to iterate through bag number to compare the contents with a jaccard similarity or distance of each pair of bag. If possible with the option of considering the quantity as a weight of comparison. the ideal result would be a dataframe like that Python Pandas Distance matrix using jaccard similarity
I feel that the solution is somewher between this > How to compute jaccard similarity from a pandas dataframe and that How to apply a custom function to groups in a dask dataframe, using multiple columns as function input
I am thinking I should iterate through a mask for setting up the two variable of jaccard function. But in every example I see, the items to compare are in different columns. So I am kind of lost, here... thanks lot for helping! cheers