I have a 6000 row dataframe that looks like this:
index name title appearance
0 John Article 1 1.0
1 John Article 3 1.0
2 Jane Article 1 1.0
3 Jane Article 2 1.0
4 Sarah Article 2 1.0
I've created an adjacency matrix by taking the cross product of the dataframe:
covar_df = pd.DataFrame(columns = df.name.unique(), index = df.title.unique())
covar_df = covar_df.fillna(0)
for index, row in df.iterrows():
person = df.loc[index, 'name']
appearance = df.loc[index, 'appearance']
covar_df.loc[df.loc[index, 'title'], person] += appearance
adjacency_df = pd.DataFrame(np.dot(covar_df.T, covar_df), index = df.name.unique(), columns = df.name.unique())
Most of the nodes in the adjacency matrix are correct, but are not. For instance, using the real data, if I input:
[In]: covar_df['John'].sum()
[Out]: 626
But the node where John intersects with John in the adjacency matrix is 630.
I'm hesitant to share the dataset itself so I'm wondering if there is something about my code generally that could be throwing this off?