1

I have a code that generates a CSV delimited by semicolon with no spacing and no headers. However, the CSV contains a series of strings and float values. The strings are folder names. The CSV data looks like this:

folder_a;folder_b;33.9
folder_b;folder_c;89.4
folder_a;folder_c;90.2

My end goal is to convert this set of csv data into an adjacency matrix so that I can input it into Scikit to do hierarchical clustering.

Each row of the CSV results records the folder names (folder_x and folder_y) and a corresponding value (you can think of it as edit distance percentage, which means normalization is not needed). In other words, the CSV data provides the values needed to fill in an adjacency matrix (or to be more specific, it is an minimum edit distance table):

ID a b c
a 0 33.9 90.2
b 33.9 0 89.4
c 90.2 89.4 0

I am not sure what is the approach I should be taking here. How should I convert those CSV data into an adjacency matrix that can be fed into Scikit? Note that the diagonals should always be 0 and the corresponding pairs of folders (e.g. (a,b) and (b,a)) should have the same values.

I am aware of a question at here (CSV to adjacency matrix) but it seems like the author really wanted to convert it to a normal array instead of an adjacency matrix.

akaoru
  • 37
  • 6

2 Answers2

0

NOTE: This is may not an effective way to do this.

Given the dataframe

df = pd.DataFrame({'col1': {0: 'folder_a', 1: 'folder_b', 2: 'folder_a'},
 'col2': {0: 'folder_b', 1: 'folder_c', 2: 'folder_c'},
 'col3': {0: 33.9, 1: 89.4, 2: 90.2}})
      col1      col2        col3
0     folder_a  folder_b    33.9
1     folder_b  folder_c    89.4
2     folder_a  folder_c    90.2
import pandas as pd
import numpy as np


# create a dictionary ex: {(folder_a, folder_b, col3):33.9, ...etc}
mapped = df.set_index(['col1', 'col2']).stack().to_dict()

# mat will store the corresponding values of all combinations
mat = []
unique_values = np.unique(df[['col1', 'col2']])
for i in unique_values:
    temp = []
    for j in unique_values:
        # now get the value of this combination from the dict made earlier
        val = mapped.get((i,j,'col3')) or mapped.get((j,i,'col3'))
        val = 0 if  val == None else val
        temp.append(val)
    mat.append(temp)
    
df2 = pd.DataFrame(mat, columns=unique_values, index = unique_values)

output

            folder_a    folder_b    folder_c
folder_a    0.0         33.9        90.2
folder_b    33.9        0.0         89.4
folder_c    90.2        89.4        0.0
Epsi95
  • 8,832
  • 1
  • 16
  • 34
0

You can use pandas.pivot_table

Using the same Dataframe as in @Epsi95 's answer, though in your case you can input the .csv into pandas using df = pd.read_csv('test2.csv', sep=';',header=None)

import pandas as pd

df = pd.DataFrame({'col1': {0: 'folder_a', 1: 'folder_b', 2: 'folder_a'},
 'col2': {0: 'folder_b', 1: 'folder_c', 2: 'folder_c'},
 'col3': {0: 33.9, 1: 89.4, 2: 90.2}})

First, create a reflected df and add that

reflected_df = pd.DataFrame(df[['col2','col1','col3']].values,columns = ['col1','col2','col3'])
df = df.append(reflected_df)

Then add the diagonal zeros

for folder in pd.concat([df['col1'],df['col2']]).unique():
    df = df.append({'col1':folder,'col2':folder,'col3':0.0},ignore_index=True)

Finally, you can get the desired result by using pivot_table

df['col3'] = df['col3'].astype(float)

table = pd.pivot_table(df,values='col3',index = 'col1',columns='col2')