6

I am utilizing pandas to create a dataframe that appears as follows:

ratings = pandas.DataFrame({
    'article_a':[1,1,0,0],
    'article_b':[1,0,0,0],
    'article_c':[1,0,0,0],
    'article_d':[0,0,0,1],
    'article_e':[0,0,0,1]
},index=['Alice','Bob','Carol','Dave'])

I would like to compute another matrix from this input one that will compare each row against all other rows. Let's assume for example the computation was a function to find the length of the intersection set, I'd like an output DataFrame with the len(intersection(Alice,Bob)), len(intersection(Alice,Carol)), len(intersection(Alice,Dave)) in the first row, with each row following that format against the others. Using this example input, the output matrix would be 4x3:

len(intersection(Alice,Bob)),len(intersection(Alice,Carol)),len(intersection(Alice,Dave))
len(intersection(Bob,Alice)),len(intersection(Bob,Carol)),len(intersection(Bob,Dave))
len(intersection(Carol,Alice)),len(intersection(Carol,Bob)),len(intersection(Carol,Dave))
len(intersection(Dave,Alice)),len(intersection(Dave,Bob)),len(intersection(Dave,Carol))

Is there a named method for this kind of function based computation in pandas? What would be the most efficient way to accomplish this?

DeaconDesperado
  • 9,977
  • 9
  • 47
  • 77

2 Answers2

7

I am not aware of a named method, but I have a one-liner.

In [21]: ratings.apply(lambda row: ratings.apply(
... lambda x: np.equal(row, x), 1).sum(1), 1)
Out[21]: 
       Alice  Bob  Carol  Dave
Alice      5    3      2     0
Bob        3    5      4     2
Carol      2    4      5     3
Dave       0    2      3     5
Dan Allan
  • 34,073
  • 6
  • 70
  • 63
1

@Dan Allan solution is 'right', here's a slightly different way of approaching the problem

In [26]: ratings
Out[26]: 
       article_a  article_b  article_c  article_d  article_e
Alice          1          1          1          0          0
Bob            1          0          0          0          0
Carol          0          0          0          0          0
Dave           0          0          0          1          1

In [27]: ratings.apply(lambda x: (ratings.T.sub(x,'index')).sum(),1)
Out[27]: 
       Alice  Bob  Carol  Dave
Alice      0   -2     -3    -1
Bob        2    0     -1     1
Carol      3    1      0     2
Dave       1   -1     -2     0
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Interesting. I replaced my list comprehension with a slightly nicer nested apply. But this is even more compact. I wonder if ``np.equal`` can be worked into it.... – Dan Allan Jun 04 '13 at 18:27