0

I have a database that looks like this.

ID   Covid_pos  Asymptomatic   Fever   Cough  Rash
1        1          0            1      0      1
2        0          0            0      1      0
3        1          1            0      1      1
4        1          0            1      0      1
5        0          1            1      0      0

From this data my goal is to create an output that looks like this

Symptom          All Tested(5308, 100%)   SARS-COV-2 PCR positive (N,%) 
Asymptomatic        2528(47.63%)                 163(6.45%)
Fever               958(23.85%)                  43(3.53%)
Cough               159(3.95%)                   22(9.72%)
Rash                19(23.05%)                   88(18.40%)

I have written a code that will produce the desired output for one of my my variables; however, I would like to create either a macro or function so I can apply it to all of my symptom variables. Thus, instead of copying and pasting this code 8+ times and changing it whenever the code says "Asymptomatic" to the next symptom, I was curious if there are any other options you suggest I explore. Somewhat new to to Python, so all strategies are welcome!

AsyOdds_Percent = pd.crosstab(df_merged2["Asymptomatic"],df_merged2.Covid_pos)
AsyOdds_Percent = pd.DataFrame(AsyOdds_Percent.to_records()).rename(columns={'Asymptomatic':'Asymptomatic','0':'Neg_%','1':'Pos_%'}).fillna(0)
AsyOdds_Percent["Total_%"] = AsyOdds_Percent.sum(axis=1)

AsyOdds_Count=pd.crosstab(df_merged2["Asymptomatic"],df_merged2.Covid_pos)
AsyOdds_Count1 = pd.DataFrame(AsyOdds_Count.to_records()).rename(columns={'Asymptomatic':'Asymptomatic','0':'Neg_N','1':'Pos_N'}).fillna(0)
AsyOdds_Count1["Total_N"] = AsyOdds_Count1.sum(axis=1)

cols = AsyOdds_Percent.columns[1:4]
AsyOdds_Percent[cols] = AsyOdds_Percent[cols]/AsyOdds_Percent[cols].sum()*100
Merged = pd.merge(AsyOdds_Count1,AsyOdds_Percent, on='Asymptomatic', how='left')
Merged['%_Pos'] = (Merged['Pos_N']/Merged['Total_N'])*100
Merged['%_Pos'] = round(Merged['%_Pos'], 2)
Merged['Total_%'] = round(Merged['Total_%'], 2)
Merged = Merged[['Asymptomatic','Pos_N','Pos_%','Neg_N','Neg_%','Total_N','Total_%','%_Pos']]
Merged = Merged.loc[Merged['Asymptomatic'] == 1]
Merged = Merged[['Asymptomatic','Total_N','Total_%','Pos_N','%_Pos']]
Merged = Merged.rename(columns = {"Asymptomatic": "Symptoms"})

a1 = (Merged["Symptoms"] == 1)
conditions = [a1]
Merged['Symptoms'] = np.select([a1], ['Asymptomatic'])
  
Merged['All Tested (5308, 100%)'] = Merged['Total_N'].map(str) + '(' + Merged['Total_%'].map(str) + '%)'
Merged['SARS-COV-2 PCR positive (N,%)'] = Merged['Pos_N'].map(str) + '(' + Merged['%_Pos'].map(str) + '%)'
Merged=Merged[['Symptoms','All Tested (5308, 100%)','SARS-COV-2 PCR positive (N,%)']]
print(Merged)

OUTPUT:

       Symptoms All Tested (5308, 100%) SARS-COV-2 PCR positive (N,%)
1  Asymptomatic            2528(47.63%)                    163(6.45%)
Raven
  • 849
  • 6
  • 17
  • 2
    1: Show us the input and output without the dot `....`, part. Ideally , Input should relate to the output , 2: Show us the exact problem you are facing instead of a ~20 line code – anky Aug 07 '20 at 18:40
  • Edited the "....". My problem, as I stated above, is that I would like to apply this code to each of my variable without having to copy and paste it, while changing the variable and names of the final output dataframes (since I would like to append the outputs into one table). I just do not have a good idea how to execute a solution to my desired goal. – Raven Aug 07 '20 at 18:54
  • 1
    your question is still unclear, please edit the output per the input and create a well defined short explanation of the logic. please edit the body of the question instead of comments..!! Thanks for responding, you might want to visit [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – anky Aug 07 '20 at 18:57
  • 1
    The percentages you have shown need an explanation as well. Can i request of you just run the above code for only the sample data that you have displayed in the question and show expected output accordingly? Needless to say, it will scale up on larger datasets. – Akshay Sehgal Aug 07 '20 at 19:10

2 Answers2

1

I used the following data sample (df):

   Covid_pos  Asymptomatic  Fever  Cough
0          1             0      1      0
1          0             0      0      1
2          1             1      0      1
3          1             0      1      0
4          0             1      1      0
5          1             0      1      0
6          0             1      1      0
7          1             0      0      1
8          0             0      0      0
9          0             0      0      0

Start from defining 3 functions:

def colSums(col):
    return pd.Series([col.sum(), col.loc[1].sum()], index=['All', 'Pos'])
def withPct(x):
    return f'{x}({x / total * 100}%)'
def colTitle(head, n1):
    return f'{head}({n1}, {n1/total*100}%)'

Then compute required totals:

total = df.index.size
totalPos = df.Covid_pos.sum()

And the whole processing (for all source columns) boils down to just 2 instructions:

res = df.set_index('Covid_pos').apply(colSums).T.applymap(withPct)
res.columns = [colTitle('All Tested', total),
    colTitle('SARS-COV-2 PCR positive', totalPos)]

The result is:

             All Tested(10, 100.0%) SARS-COV-2 PCR positive(5, 50.0%)
Asymptomatic               3(30.0%)                          1(10.0%)
Fever                      5(50.0%)                          3(30.0%)
Cough                      3(30.0%)                          2(20.0%)

Edit

To compute percentage in the "positive" column relative to the number of positive cases, proceed as follows:

  1. Compute the result in absolute numbers:

     res = df.set_index('Covid_pos').apply(colSums).T
    
  2. Compute percentages dividing each column by respective divisor:

     wrk = res / [total, totalPos] * 100; wrk
    
  3. Overwrite each column in res with concatenation of "original" values and percentage surrounded in parentheses.

     res.All = res.All.astype(str) + '(' + wrk.All.astype(str) + '%)'
     res.Pos = res.Pos.astype(str) + '(' + wrk.Pos.astype(str) + '%)'
    

Now the result is:

             All Tested(10, 100.0%) SARS-COV-2 PCR positive(5, 50.0%)
Asymptomatic               3(30.0%)                          1(20.0%)
Fever                      5(50.0%)                          3(60.0%)
Cough                      3(30.0%)                          2(40.0%)

withPct function is now not needed.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • This is genius, I just need to tinker with it so the percentages are rounded to the hundredths. Would love to learn how your functions work! – Raven Aug 07 '20 at 20:14
  • actually I noticed one thing that needs to be fixed, so the SARS-COV 'N' needs to be a percentage of the N for All Tested, so for example in your example, the % in the SARS columns needs to 1/3 (33%) - not 1/5. – Raven Aug 07 '20 at 20:32
  • 1
    Sorry for delayed correction of the initial solution. I was "out" for last recent days. – Valdi_Bo Aug 16 '20 at 12:40
1

Maybe this works for you -

df = pd.DataFrame({'Covid_pos':[1,0,1,1,0], 'Asymptomatic':[0,0,1,0,1], 'Fever':[1,0,0,1,1], 'Cough':[0,1,1,0,0],'Rash':[1,0,1,1,0]})
df = df.rename(columns = {'Covid_pos':'SARS-COV-2 PCR positive'})
df['All Tested'] = 1   #Adding a dummy column with all values as 1 for ALL TESTED

symptoms = ['Asymptomatic','Fever','Cough', 'Rash']
targets = ['SARS-COV-2 PCR positive', 'All Tested']

df2 = df.set_index(targets).stack().reset_index().set_axis(targets+['symptoms','flg'], axis=1)
df3 = df2.groupby(['symptoms','flg'])[targets].sum().reset_index()
df4 = df3[df3['flg']==1].drop('flg', axis=1)
df4.columns = ['symptoms']+targets
df4[[i+' %' for i in targets]] = df4[targets].apply(lambda x : round(x/x.sum()*100,ndigits=2))
df4
       symptoms  SARS-COV-2 PCR positive  All Tested  \
1  Asymptomatic                        1           2   
3         Cough                        1           2   
5         Fever                        2           3   
7          Rash                        3           3   

   SARS-COV-2 PCR positive %  All Tested %  
1                      14.29          20.0  
3                      14.29          20.0  
5                      28.57          30.0  
7                      42.86          30.0  
Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
  • @thanks, trying to work through this but what is the purpose of the 'flg'. Im asking because I'm getting the following error ----> 2 df3 = df2.groupby(['symptoms','flg'])[targets].sum().reset_index() AttributeError: 'NoneType' object has no attribute 'groupby' – Raven Aug 07 '20 at 20:01
  • 1
    Since the data is boolean of sorts (0 means True and 1 means False), when you group by Symptoms and the boolean value you get one row for (Symptom1, True) and another for (Symptom1, False). Similarly you get 2 rows for each symptom. Now in your output you only care about the count of rows where the boolean is True. The flag is the variable which holds (TRUE, FALSE or 0,1) for each symptom. After grouping it, i am only selecting the rows with (symptom,True) and dropping the flg completely since its not needed.. If you print df2, it will make it clearer – Akshay Sehgal Aug 07 '20 at 23:22
  • 1
    regarding NoneType, does your data have Nan values? – Akshay Sehgal Aug 07 '20 at 23:23