0

I have a pandas data frame sample dataframe

df =    a1   a2   a3   a4   a5 

         0    1    1     1    0      #dict[a3_a4]  = 1 ,dict[a2_a4]  = 1 ,dict[a2_a3]  = 1
         1    1    1     0    0      #dict[a1_a2]  = 1 , dict[a1_a3]  = 1, dict[a2_a3]  = 1

I need function gets data frame as input and return the number of appearing of 2 columns together and store it in the dictionary so my output will be like

output dict will look like this : {'a1_a2':1,'a2_a3':2, 'a3_a4':1,'a1_a3':1,'a2_a4':1}

Pseudo code if needed pseudo_code

PS: I am new to stack overflow so forgive me for my mistakes.

  • What have you tried so far? Please read this https://stackoverflow.com/help/minimal-reproducible-example and this https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples and edit your question accordingly. – alec_djinn Jun 11 '20 at 13:20
  • my data is 2000 line and 20k columns And only 35% cell is containing value 1 so how to reduce time also – Sharad Rai Jun 11 '20 at 17:47

1 Answers1

0

You can use itertools combinations to get all the pairs of columns. Then you can multiply up the values and take the sum of them.

from itertools import combinations

cc = list(combinations(df.columns,2))
df1 = pd.concat([df[c[1]]*df[c[0]] for c in cc], axis=1, keys=cc)
df1.columns = df1.columns.map('_'.join)

d = df1.sum().to_dict()

print(d)

Output:

{'a1_a2': 1,
 'a1_a3': 1,
 'a1_a4': 0,
 'a1_a5': 0,
 'a2_a3': 2,
 'a2_a4': 1,
 'a2_a5': 0,
 'a3_a4': 1,
 'a3_a5': 0,
 'a4_a5': 0}
DavideBrex
  • 2,374
  • 1
  • 10
  • 23
  • for a, b in combinations(df, 2): dict[ a + b ] = sum([ x == y for x, y in zip(df[a], df[b])]) – Edward Jun 11 '20 at 13:31
  • @Edward This gives a1_a5: 1, which I think is wrong – DavideBrex Jun 11 '20 at 13:48
  • is it? row 0 of both is 0 – Edward Jun 11 '20 at 13:52
  • a1_a5 should be 0 because neither in row 0 nor in row 1 both the two values are (1,1). From the output dictionary, he wants only pair of columns with at least one (1,1) to return 1. – DavideBrex Jun 11 '20 at 14:14
  • my data is 2000 line and 20k columns And only 35% cell is containing value 1 so how to reduce time @DavideBrex – Sharad Rai Jun 11 '20 at 17:47
  • Probably you shold drop all columns that contain only zeros. You don't care about those right? – DavideBrex Jun 11 '20 at 18:00
  • @SharadRai I am sorry but I don't know what to do to speed up the process. Also I am not sure my code works with more than 2 rows in the dataframe. Please update the question with an example with more rows and columns and the precise expected output. – DavideBrex Jun 11 '20 at 20:22
  • No issue I got the solution – Sharad Rai Jun 12 '20 at 18:33