how to use pandas to count occurrence of specific text in excel

Question

first time here, just started learning to code, I am conducting a clinical study regarding some risk factors of a disease, here I already got an excel of patient data. The purpose of the code is to count the number of risk factors(obesity, hypertension, diabetes, hyperlipidemia) of each patient(each row), and print the result in a new column, and the last step, count the number of how many patients have total 4 risk factors, and how many have 3, 2 and only one, or none.

date frame is something like this(just an example, not breaking confidentiality): part of the dataframe

well, try this part in python, just made it up, and I tried the following code:

import pandas as pd
df1=pd.DataFrame({'gender':['male','male','female','female','male'],'age':[49,60,65,20,65],
                  'obesity':['yes','yes','NaN','NaN','yes'],
                  'hypertension':['yes','yes','yes','NaN','yes'],
                  'diabetes':['NaN','yes','NaN','NaN','yes'],
                  'hyperlipidemia':['yes','yes','yes','NaN','NaN']})
factor_count=[] #to be written in the very right column
row=0
column=3
while row<=5:             #5 rows in total for this example
    count=0               #to count the risk factors of each row
    while column<=5:
        if df.iloc[row,column] == 'yes':         #probably my while loop is really stupid
            count+=1
            column+=1
    factor_count.append(count)
    row+=1
print(factor_count)

well, after I hit run, the kernel never stops, I just learned to program on my own, hence I have no idea what happened, so I had to terminate the kernel. Can someone help me with this?

You're only incrementing `column` when the `if` condition evaluates to True, hence, you're getting stuck infinitely in the second while loop — Gabriela Melo, May 07 '20 at 16:15

score 0 · Accepted Answer · answered May 07 '20 at 16:30

0

You can replace the 'yes' in dataframe by 1 and then use the method sum:

df1.replace('yes',1,inplace=True)
df1.iloc[:,[2,3,4,5]] = df1.iloc[:,[2,3,4,5]].astype(float)
df1["Numbers of factor"] = df1.iloc[:,[2,3,4,5]].sum(axis=1)

Then an histogram of this column should give how many patients have 1,2 3 ... risk

df1["Numbers of factor"].hist()

answered May 07 '20 at 16:30

STM

51
2

you are magic, it worked! one more question,how do i see the exact number of patients who have 4, 3, 2 and only 1 risk factors without checking histogram? – seraphczx May 08 '20 at 00:28
@seraphczx This could of some help - https://stackoverflow.com/a/48770057/8505509 – Ganesh Tata May 08 '20 at 05:42
df1.groupby("Numbers of factor").size() – STM May 08 '20 at 13:27

how to use pandas to count occurrence of specific text in excel

1 Answers1