0

I am currently working on a big data set that looks like this:

Here you can see the number of daily COVID-19 cases reported.

The problem that I have is that the cases that are reported daily are divided in 'Negative', 'Positive' and 'Inconclusive' cases. My goal is to sum up the number of cases that are reported daily but at the same time I also want to create separate columns for each kind of cases (a column for the Negative cases per day, one for the Positive ones and another one for the Inconclusive ones).

To reach my goal all I need to do is to somewhat filter the data set by creating a condition using the overall_outcome column and new_results_reported column. I tried it with the negative cases:

america3 = pd.DataFrame(data, columns = ['overall_outcome', 'new_results_reported']) contain_values = america3[america3['overall_outcome'].str.contains('Negative')] contain_values.head(20)

and I got this

I just don't know if I am doing this correctly.If what I did is somewhat correct then I still can't figure out how to create a new column using the negative cases numbers only. And if it's not correct then I do not know what step to take next. I guess the problem is that the overall_outcome is an object and the new_results_reported is an int64.

I hope that I am making sense.

Kraigolas
  • 5,121
  • 3
  • 12
  • 37
dalsoos
  • 3
  • 2
  • 2
    Welcome to StackOverflow! Please see how to create a [mre]. Specifically, we need an example of your expected output. [How to make good reproducible pandas examples](https://stackoverflow.com/a/20159305/11659881) is also a very relevant discussion to this question. – Kraigolas Jun 26 '21 at 19:56

1 Answers1

0

Panda's groupby function is what you want, I think. Create a new dataframe with only the columns you want, with the tally of all negative, positive, or inconclusive cases summed by outcome and date. See this toy example:

import pandas as pd

row1list = ['Negative', 100, '2020/03/01', 'abc']
row2list = ['Negative', 200, '2020/03/01', 'abc']
row3list = ['Negative', 110, '2020/03/02', 'abc']
row4list = ['Negative', 220, '2020/03/02', 'abc']
row5list = ['Inconclusive', 10, '2020/03/01', 'bcd']
row6list = ['Inconclusive', 20, '2020/03/01', 'bcd']
row7list = ['Positive', 3, '2020/03/01', 'tuv']
row8list = ['Positive', 5, '2020/03/01', 'tuv']
america3 = pd.DataFrame([row1list, row2list, row3list, row4list, row5list, row6list, row7list, row8list],
                  columns=['overall_outcome', 'new_results_reported', 'date', 'col_xyz'])

df_dailies = america3.groupby(['overall_outcome', 'date'], as_index=False).agg({
    'new_results_reported': sum})

Gives you this dataframe


  overall_outcome        date  new_results_reported
0    Inconclusive  2020/03/01                    30
1        Negative  2020/03/01                   300
2        Negative  2020/03/02                   330
3        Positive  2020/03/01                     8
Hammurabi
  • 1,141
  • 1
  • 4
  • 7