1

I am messing around in the NYT covid dataset which has total covid cases for each county, per day.

I would like to find out the difference of cases between each day, so theoretically I could get the number of new cases per day instead of total cases. Taking a rolling mean, or resampling every 2 days using a mean/sum/etc all work just fine. It's just subtracting that is giving me such a headache.

Tried methods:

  • df.resample('2d').diff()
    • 'DatetimeIndexResampler' object has no attribute 'diff'

  • df.resample('1d').agg(np.subtract)
    • ufunc() missing 1 of 2required positional argument(s)

  • df.rolling(2).diff()
    • 'Rolling' object has no attribute 'diff'

  • df.rolling('2').agg(np.subtract)
    • ufunc() missing 1 of 2required positional argument(s)

Sample data:

pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
               'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
               'covid_cases':[1.2,2.0,2.9,3.6,3.9]
              })

enter image description here

Desired sample output:

pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
               'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
               'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3]
              })

enter image description here

Recreate sample data from original NYT dataset:

df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates=['date'])
df.groupby(['state','date'])[['cases']].mean().reset_index()

Any help would be greatly appreciated! Would like to learn how to do this manually/via function rather than finding a "new cases" dataset as I will be working with timeseries a lot in the very near future.

sc4s2cg
  • 153
  • 9

2 Answers2

3

Let's try this bit of complete code:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')

df['date'] = pd.to_datetime(df['date'])

df_daily_state = df.groupby(['date','state'])['cases'].sum().unstack()

daily_new_cases_AL = df_daily_state.diff()['Alabama']

ax = daily_new_cases_AL.iloc[-30:].plot.bar(title='Last 30 days Alabama New Cases')

Output:

enter image description here

Details:

  • Download the historical case records from NYTimes github using the raw URL
  • Convert the dtype of the 'date' column to datetime dtype
  • Groupby 'date' and 'state' columns sum 'cases' and unstack the state level of the index to get dates of rows and states for columns.
  • Take the difference by columns and select only the Alabama column
  • Plot the last 30 days
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
2

The diff function is correct, but if you look at your error message:

'DatetimeIndexResampler' object has no attribute 'diff'

in your first tried methods, it's because diff is a function available for DataFrames, not for Resamplers, so turn it back into a DataFrame by specifying how you want to resample it.

If you have the total number of COVID cases for each day and want to resample it to 2 days, you probably only want to keep the latest update out of the two days, in which case something like df.resample('2d').last().diff() should work.

Fantailed
  • 98
  • 5
  • Hmm, that does work. I will have to mess around a bit to convince myself, but it seems like if I do a `df.resample('2d').sum().diff` then I should get back new cases over 4 days. Since I'm summing up `day1 + day2 = new_day2`, and then `day3 + day4 = new_day4`, and then subtracting them `new_day4 - new_day2 = new_day4`. Right? – sc4s2cg Aug 19 '20 at 14:26
  • Sorry, I brainfarted there. If you have the cases per day and you want to resample to every 2 days, you probably only want to keep the latest update though, right? In which case, `last()` is the function you are looking at. Will update my answer. Sorry for the confusion! – Fantailed Aug 19 '20 at 14:40
  • I _think_ you're right. The dataset is of total cases each day (so 0,0,1. Then if the fourth day we find 3 new cases it becomes 0,0,1,4). So with your method of `day1|day2` it will keep `day2`. Then of `day3|day4` it will keep `day4`. Finally, `.diff()` will take the difference so that `day4-day2 = new_day4`. Is that the basic logic of it? Just want to double and triple check, because I went ahead and graphed mean.diff, sum.diff, and last.diff methods and while the first two are pretty much identical to each other, the last one is wildly different. – sc4s2cg Aug 19 '20 at 14:51
  • Maybe it's easier to think of things not in terms of 1 day vs. 2 day frequency, but in hourly vs. daily. If you had to squash hourly COVID case numbers into a single number, you would probably report the number at the end of the day, which is what many sites do, so we can assume it's an accurate way of resampling data. Once you know you have accurate data, the diff function just does its job. – Fantailed Aug 19 '20 at 15:07
  • Aha, that makes a ton of sense. Lets say we want to squish 7 days worth of case numbers into one. If we take the last day's number, that represents everything before it too (since its the sum of new cases). Then just use diff to subtract week1 from week 2 and you have the number of new cases between weeks. I think I got it, thanks for all your help! – sc4s2cg Aug 19 '20 at 15:16