0

I have a two-column dataframe where each row represents a pair.

import pandas as pd
x = pd.DataFrame([['dog', 'cat'], ['fish', 'parrot'], ['dog', 'llama'], ['pig', 'sloth']])

My goal is to convert this into a square matrix, where both the index and column headers are filled with the unique values of the original dataframe, like this:

enter image description here

Using the helpful answer here, I can make a matrix based on the values:

df6 = x.pivot_table(index=0, columns=1, values=1, aggfunc='size', fill_value=0)

This is not quite what I want because it is not square (certain values, like 'dog,' are present in the index but omitted from the columns).

I altered the above to manually type in the items for the columns and rows:

df7 = df6.reindex(index=["cat","fish","pig","llama","parrot","sloth"], columns=["cat","fish","pig","llama","parrot","sloth"], fill_value=0)

Again, this is not quite what I want because it is time-consuming to construct. So I tried adding a line to get the unique list of values:

listOfItems = pd.unique(df.values.ravel('K')

This doesn't work because it gives me 0 and 1, rather than the string values. So I tried obtaining the unique values of the header and column using the following:

listOfColumns = df6.columns
listOfIndex = df6.index
joinedlist = listOfColumns + listOfIndex

but I get an error message: operands could not be broadcast together with shapes (4,) (3,)

Does anyone have a good way to make a square matrix?

oymonk
  • 427
  • 9
  • 27

1 Answers1

1

I have very little experience in pandas so am sure someone will come along with some nifty pandas command to transform the data as you wish however in the mean time i fancied a play of this and came up with the below code.

import pandas as pd
pairs = [['dog', 'cat'], ['fish', 'parrot'], ['dog', 'llama'], ['pig', 'sloth']]
x = pd.DataFrame(pairs)

names = sorted(set([name for name in [*x[0], *x[1]]]))
data = [[0]*len(names) for _ in range(len(names))]
y = pd.DataFrame(data=data, columns=names, index=names)
for pair in pairs:
    y[pair[1]][pair[0]] = 1

print(y)

OUTPUT

        cat  dog  fish  llama  parrot  pig  sloth
cat       0    0     0      0       0    0      0
dog       1    0     0      1       0    0      0
fish      0    0     0      0       1    0      0
llama     0    0     0      0       0    0      0
parrot    0    0     0      0       0    0      0
pig       0    0     0      0       0    0      1
sloth     0    0     0      0       0    0      0
Chris Doyle
  • 10,703
  • 2
  • 23
  • 42
  • Thank you for your answer. It works for the toy dataset, but not for the real one - I get an error on the line y[pair[1]][pair[0]] ('int' object is not subscriptable'). I think this is because the real dataset doesn't have an index (it just has a bunch of zeros). Will try again and get back to you. – oymonk Dec 17 '19 at 21:46
  • It works!! I altered it slightly: x = pd.DataFrame(pairs) names = sorted(set([name for name in [*x[0], *x[1]]])) x = x.pivot_table(index=0, columns=1, values=1, aggfunc='size', fill_value=0) df7 = x.reindex(index=names, columns=names, fill_value=0) df7.to_csv('df7.csv') – oymonk Dec 17 '19 at 23:27
  • awesome glad it helped. Like i say i dont have much experience with pandas so there might be a an easier way to do this but at least this gets you going for now – Chris Doyle Dec 17 '19 at 23:28