0

I have an existing dataframe like this:

         desease      occurences        symptoms


                                          s1_d1
 1       desease_1        200             s2_d1
                                          s3_d1


                                          s1_d2
2        desease_2        300             s2_d2
                                          s3_d2


and I would like to create a new one based on this like this

       s1_d1       s2_d1       s3_d1       s1_d2       s2_d2       s3_d2       occurences        desease

1       1           1           1           0           0           0            200           desease_1        

2       0           0           0           1           1           1            300           desease_2 

I tried a lot of methods but didn't get any interesting result, can someone help me with one good astuce please?

LW001
  • 2,452
  • 6
  • 27
  • 36
  • Can you show the dataframe by doing ```df.to_dict()``` and adding that here? – Suraj Shourie Aug 23 '23 at 15:28
  • 2
    Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. – itprorh66 Aug 23 '23 at 16:44
  • 2
    How did you get that data to begin with? It might be more efficient to process the data BEFORE moving into pandas. – Tim Roberts Aug 23 '23 at 18:24
  • 1
    What are the actual contents of the original `symptoms` column? Is that a list within a cell? – Reinderien Aug 23 '23 at 18:42
  • Is [`pandas.DataFrame.pivot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html) along the lines of what you are looking for? – hwhap Aug 27 '23 at 07:40

1 Answers1

1

Using the get_dummies method would be a helpful trick as it can be used to convert the any column into binary columns (0s and 1s).

Here is the code that will give you your desired output:

import pandas as pd

data = {
    'desease': ['desease_1', 'desease_2'],
    'occurences': [200, 300],
    'symptoms': ['s1_d1\ns2_d1\ns3_d1', 's1_d2\ns2_d2\ns3_d2']
}

# Create a DataFrame from the sample data
df = pd.DataFrame(data)

# Use Pandas to convert the 'symptoms' column into binary columns
symptom_columns = df['symptoms'].str.get_dummies(sep='\n')

# Concatenate the binary symptom columns with the original DataFrame,
# including 'occurences' and 'desease' columns
result_df = pd.concat([symptom_columns, df[['occurences', 'desease']]], axis=1)

# Define the desired order of columns
columns_order = ['s1_d1', 's2_d1', 's3_d1', 's1_d2', 's2_d2', 's3_d2', 'occurences', 'desease']

# Reorder the columns in the result DataFrame
result_df = result_df[columns_order]

print(result_df)

The following result is printed:

   s1_d1  s2_d1  s3_d1  s1_d2  s2_d2  s3_d2  occurences    desease
0      1      1      1      0      0      0         200  desease_1
1      0      0      0      1      1      1         300  desease_2
Natália
  • 34
  • 1