0

My data in excel look like this.

enter image description here

fi denote some features, IDi denotes customer IDs and the numbers denote the times that a certain feature appeared.

I would like to count the pairs of features that appear together in these IDs and come up with some like this enter image description here

This matrix is to be interpreted in the following way: (f1,f2) appeared three times together (in ID2,ID3,ID4) (f2,f3) appeared one time together (in ID3) (f1,f4) appeared 3 times together (in ID1, ID2, ID3) and so on

This is my jupyter enter image description here

  • depending on how you had structured the data in the python code, a for loop with multiple pair check counter should do. – p._phidot_ Jun 14 '21 at 16:34
  • The dataframe will look like in the first picture. Thank you for your advice, feel free to share code as well – Xristos Lymperopoulos Jun 14 '21 at 19:17
  • I mean, in your code (please share).. how does table 1 is defined ? Plus, what variable/datafram structure d u use to store the (f1,f2), (f2,f3) etc pairs ? [ You only share how your data (table 1) look like in excel.. not how it was captured in your code (dataframe variable name/numbering). ] – p._phidot_ Jun 14 '21 at 19:51
  • I imported this excel in jupyter notebook using : df=pd.read("data.xlsx", engine="openpyxyl") df.fillna(0,inplace=True) df.set_index("Features",inplace=True) IDs appear as headers and features f1,f2... appear as indexes – Xristos Lymperopoulos Jun 15 '21 at 06:45

1 Answers1

0
import pandas as pd

df=pd.read_excel("data.xlsx")
print(df)

# convert to binary

df2=df
for i in range(4) :
    for j in range(1,5) :
        if df2.iloc[i,j] > 0 :
            df2.iloc[i,j] = 1
print(df2)

ar=df2.iloc[:,1:].values # extract f1-f4 array values | ref#2

df3 = df2
df3.rename(columns = {'ID1':'f1','ID2':'f2','ID3':'f3','ID4':'f4'}, inplace = True ) # ref#1

for i in range(4) :
    for j in range(1,5) :
        if i == (j-1) :
            df3.iloc[i,j] = 0
        else : 
            df3.iloc[i,j] = sum(ar[i]*ar[j-1]) # found out that df*df didn't work.
            
print(df3)

dfa = dfb is decieving.. it actually works like a pointer, not variable. print df2 & df and you can see. df2=df just shares the reference, it's not initiating new variable . That's why df = sum(df*df) didn't work.

Please have a try, and share if its work/understandable/not.

Ref :

[1] https://www.geeksforgeeks.org/how-to-rename-columns-in-pandas-dataframe/

[2] Pandas dataframe , using iloc to replace last row

p._phidot_
  • 1,913
  • 1
  • 9
  • 17