0

first i'm beginner in python I have the same problem, to craete matrix my data csv file into pandas dataFrame like the following format

   disease  symptom  frequence
0   d1       s1       Very frequent (99-80%)
1   d1       s2       Very frequent (99-80%)
2   d2       s1       Frequent (79-30%)
3   d2       s3       Very frequent (99-80%)
4   d3       s2       Occasional (29-5%
5   d4       s1       Very frequent (99-80%)
6   d4       s2       Frequent (79-30%)
7   d4       s3       Occasional (29-5%
8   d5       s3       Occasional (29-5%
9   d5       s4      Very frequent (99-80%)

===>d= name disease, s = name symptom

i would like to create matrix between Disease and symptome in ordre to predict each diseases associate to their symptoms.... the main of matrix to do mathematic test

i would like to be like this:

    s1  s2  s3  s4  s5 s6
d1  1   1   0   0   0  0
d2  1   0   1   0   0  0
d3  0   1   1   1   1  1
d4  1   0   1   0   0  0
d5  0   0   1   1   0  0

if d associated to s print 1 in matrix if not print 0

my data is too long 72036 rows × 3 columns

my attempt to do that based in the previous answer from ysearka

import pandas as pd import numpy as np import io

data = pd.read_csv("disease_sym_frq_list.csv", sep="[;,]", engine='python')
data

dat_mat= io.StringIO("""\data

""")
mat = pd.read_csv(dat_mat, delim_whitespace=True)

data['norm'] = data.groupby('Disease')['Frequence'].transform('sum')

m = pd.merge(data, mat, left_on='Symptom', right_index=True)
m[mat.index] = m[mat.index].multiply(m['Frequence'] / m['norm'], axis=0)

output = m.groupby('Disease')[mat.index].sum()
output.columns.name = 'Symptom'
print(output)

the output was:

Empty DataFrame
Columns: []
Index: []

how i can resolve this problem

if anyone help me much appreciate! thanks

Ben Aawf
  • 5
  • 4

1 Answers1

0

You can simply use pandas.DataFrame.pivot:

df['value'] = 1
df_pivot = df.pivot(index='disease', columns='symptom', values='value').fillna(0)

print(df_pivot)
symptom   s1   s2   s3   s4
disease                    
d1       1.0  1.0  0.0  0.0
d2       1.0  0.0  1.0  0.0
d3       0.0  1.0  0.0  0.0
d4       1.0  1.0  1.0  0.0
d5       0.0  0.0  1.0  1.0

note: you didnt provide a complete dataframe thats why the output does not contain s5, s6 etc.

Erfan
  • 40,971
  • 8
  • 66
  • 78
  • yeah all right because is too long my file input is csv file.....but you get my point bro ...thanks ....i get this ValueError: Index contains duplicate entries, cannot reshape, when i run your script , what mean ? – Ben Aawf Mar 21 '19 at 17:10