first i'm beginner in python I have the same problem, to craete matrix my data csv file into pandas dataFrame like the following format
disease symptom frequence
0 d1 s1 Very frequent (99-80%)
1 d1 s2 Very frequent (99-80%)
2 d2 s1 Frequent (79-30%)
3 d2 s3 Very frequent (99-80%)
4 d3 s2 Occasional (29-5%
5 d4 s1 Very frequent (99-80%)
6 d4 s2 Frequent (79-30%)
7 d4 s3 Occasional (29-5%
8 d5 s3 Occasional (29-5%
9 d5 s4 Very frequent (99-80%)
===>d= name disease, s = name symptom
i would like to create matrix between Disease and symptome in ordre to predict each diseases associate to their symptoms.... the main of matrix to do mathematic test
i would like to be like this:
s1 s2 s3 s4 s5 s6
d1 1 1 0 0 0 0
d2 1 0 1 0 0 0
d3 0 1 1 1 1 1
d4 1 0 1 0 0 0
d5 0 0 1 1 0 0
if d associated to s print 1 in matrix if not print 0
my data is too long 72036 rows × 3 columns
my attempt to do that based in the previous answer from ysearka
import pandas as pd import numpy as np import io
data = pd.read_csv("disease_sym_frq_list.csv", sep="[;,]", engine='python')
data
dat_mat= io.StringIO("""\data
""")
mat = pd.read_csv(dat_mat, delim_whitespace=True)
data['norm'] = data.groupby('Disease')['Frequence'].transform('sum')
m = pd.merge(data, mat, left_on='Symptom', right_index=True)
m[mat.index] = m[mat.index].multiply(m['Frequence'] / m['norm'], axis=0)
output = m.groupby('Disease')[mat.index].sum()
output.columns.name = 'Symptom'
print(output)
the output was:
Empty DataFrame
Columns: []
Index: []
how i can resolve this problem
if anyone help me much appreciate! thanks