Input
I have the following file input.txt
:
D E F G H
a 1 b 1 4
a 1 c 1 5
b 2 c 2 6
Desired output
How can I create a new data frame, that uses columns D and E as an index? I want a triangular matrix that looks something like this:
a1 b1 c1 b2 c2
a1 0 4 5 0 0
b1 0 0 0 0
c1 0 0 0
b2 0 6
c2 0
1st attempt
I am importing the data frame and I am trying to do a pivot like this:
import pandas as pd
df1 = pd.read_csv(
'input.txt', index_col=[0,1], delim_whitespace=True,
usecols=['D','E','F','G','H'])
df2 = df1.pivot(index=['D', 'E'], columns=['F','G'], values='H')
df1
looks like this:
F G H
D E
a 1 b 1 4
1 c 1 5
b 2 c 2 6
df1.index
looks like this:
MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1], [0, 0, 1]],
names=['D', 'E'])
df2
fails to be generated and I get this error message:
`KeyError: "['D' 'E'] not in index"`
2nd attempt
I thought I had solved it like this:
import pandas as pd
df = pd.read_csv(
'input.txt', delim_whitespace=True,
usecols=['D','E','F','G','H'],
dtype={'D':str, 'E':str, 'F':str, 'G':str, 'H':float},
)
pivot = pd.pivot_table(df, values='H', index=['D','E'], columns=['F','G'])
pivot
looks like this:
F b c
G 1 1 2
D E
a 1 4 5 NaN
b 2 NaN NaN 6
But when I try to do this to convert it to a symmetric matrix:
pivot.add(df.T, fill_value=0).fillna(0)
Then I get this error:
ValueError: cannot join with no level specified and no overlapping names
3rd attempt and solution
I found a solution here. It is also what @Moritz suggested, but I'm new to pandas and didn't understand his comment. I did this:
import pandas as pd
df1 = pd.read_csv(
'input.txt', index_col=[0,1], delim_whitespace=True,
usecols=['D','E','F','G','H'],
dtype={'D':str, 'E':str, 'F':str, 'G':str, 'H':float}
)
df1['DE'] = df1['D']+df1['E']
df1['FG'] = df1['F']+df1['G']
df2 = df1.pivot(index='DE', columns='FG', values='H')
This data frame is generated:
FG b1 c1 c2
DE
a1 4 5 NaN
b2 NaN NaN 6
Afterwards I do df3 = df2.add(df2.T, fill_value=0).fillna(0)
to convert the triangular matrix to a symmetric matrix. Is generating new columns really the easiest way to accomplish what I want? My reason for doing all of this is that I want to generate a heat map with matplotlib and hence need the data to be in matrix form. The final matrix/dataframe looks like this:
a1 b1 b2 c1 c2
a1 0 4 0 5 0
b1 4 0 0 0 0
b2 0 0 0 0 6
c1 5 0 0 0 0
c2 0 0 6 0 0