I have a code that generates a CSV delimited by semicolon with no spacing and no headers. However, the CSV contains a series of strings and float values. The strings are folder names. The CSV data looks like this:
folder_a;folder_b;33.9
folder_b;folder_c;89.4
folder_a;folder_c;90.2
My end goal is to convert this set of csv data into an adjacency matrix so that I can input it into Scikit to do hierarchical clustering.
Each row of the CSV results records the folder names (folder_x and folder_y) and a corresponding value (you can think of it as edit distance percentage, which means normalization is not needed). In other words, the CSV data provides the values needed to fill in an adjacency matrix (or to be more specific, it is an minimum edit distance table):
ID | a | b | c |
---|---|---|---|
a | 0 | 33.9 | 90.2 |
b | 33.9 | 0 | 89.4 |
c | 90.2 | 89.4 | 0 |
I am not sure what is the approach I should be taking here. How should I convert those CSV data into an adjacency matrix that can be fed into Scikit? Note that the diagonals should always be 0 and the corresponding pairs of folders (e.g. (a,b) and (b,a)) should have the same values.
I am aware of a question at here (CSV to adjacency matrix) but it seems like the author really wanted to convert it to a normal array instead of an adjacency matrix.