2

I am new to Python. How do I sum data based on date and plot the result?

I have a Series object with data like:

2017-11-03 07:30:00      NaN
2017-11-03 09:18:00      NaN
2017-11-03 10:00:00      NaN
2017-11-03 11:08:00      NaN
2017-11-03 14:39:00      NaN
2017-11-03 14:53:00      NaN
2017-11-03 15:00:00      NaN
2017-11-03 16:00:00      NaN
2017-11-03 17:03:00      NaN
2017-11-03 17:42:00    800.0
2017-11-04 07:27:00    600.0
2017-11-04 10:10:00      NaN
2017-11-04 11:48:00      NaN
2017-11-04 12:58:00    500.0
2017-11-04 13:40:00      NaN
2017-11-04 15:15:00      NaN
2017-11-04 16:21:00      NaN
2017-11-04 17:37:00    500.0
2017-11-04 21:37:00      NaN
2017-11-05 03:00:00      NaN
2017-11-05 06:30:00      NaN
2017-11-05 07:19:00      NaN
2017-11-05 08:31:00    200.0
2017-11-05 09:31:00    500.0
2017-11-05 12:03:00      NaN
2017-11-05 12:25:00    200.0
2017-11-05 13:11:00    500.0
2017-11-05 16:31:00      NaN
2017-11-05 19:00:00    500.0
2017-11-06 08:08:00      NaN

I have the following code:

# load packages
import pandas as pd
import matplotlib.pyplot as plt

# import painkiller data
df = pd.read_csv('/Users/user/Documents/health/PainOverTime.csv',delimiter=',')

# plot bar graph of date and painkiller amount
times = pd.to_datetime(df.loc[:,'Time'])

ts = pd.Series(df.loc[:,'acetaminophen'].values, index = times,
               name = 'Painkiller over Time')
ts.plot()

This gives me the following line(?) graph:

line graph of raw data

It's a start; now I want to sum the doses by date. However, this code fails to effect any change: The resulting plot is the same. What is wrong?

ts.resample('D',closed='left', label='right').sum()
ts.plot()

I have also tried ts.resample('D').sum(), ts.resample('1d').sum(), ts.resample('1D').sum(), but there is no change in the plot.

Is .resample even the correct function? I understand resampling to be sampling from the data, e.g. randomly taking one point per day, whereas I want to sum each day's values.

Namely, I'm hoping for some result (based on the above data) like:

2017-11-03 800
2017-11-04 1600
2017-11-05 1900
2017-11-06 NaN
DBinJP
  • 247
  • 5
  • 13

3 Answers3

3

Use pandas groupby function.

import io
import pandas as pd

data = io.StringIO('''
2017-11-03 07:30:00,NaN
2017-11-03 09:18:00,NaN
2017-11-03 10:00:00,NaN
2017-11-03 11:08:00,NaN
2017-11-03 14:39:00,NaN
2017-11-03 14:53:00,NaN
2017-11-03 15:00:00,NaN
2017-11-03 16:00:00,NaN
2017-11-03 17:03:00,NaN
2017-11-03 17:42:00,800.0
2017-11-04 07:27:00,600.0
2017-11-04 10:10:00,NaN
2017-11-04 11:48:00,NaN
2017-11-04 12:58:00,500.0
2017-11-04 13:40:00,NaN
2017-11-04 15:15:00,NaN
2017-11-04 16:21:00,NaN
2017-11-04 17:37:00,500.0
2017-11-04 21:37:00,NaN
2017-11-05 03:00:00,NaN
2017-11-05 06:30:00,NaN
2017-11-05 07:19:00,NaN
2017-11-05 08:31:00,200.0
2017-11-05 09:31:00,500.0
2017-11-05 12:03:00,NaN
2017-11-05 12:25:00,200.0
2017-11-05 13:11:00,500.0
2017-11-05 16:31:00,NaN
2017-11-05 19:00:00,500.0
2017-11-06 08:08:00,NaN
''')
column_names = ['date', 'val']
df = pd.read_csv(data, sep=',', header = None, names = column_names)
df['date'] = pd.to_datetime(df['date'])
df = df.groupby(df['date'].dt.date)[['val']].sum()
df.plot()
chifu lin
  • 86
  • 5
0

This answer helped me see that I needed to assign it to a new object (if that's the right terminology):

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/Users/user/Documents/health/PainOverTime.csv',delimiter=',')
# plot bar graph of date and painkiller amount
times = pd.to_datetime(df.loc[:,'Time'])

# raw plot of data
ts = pd.Series(df.loc[:,'acetaminophen'].values, index = times,
               name = 'Painkiller over Time')
fig1 = ts.plot()

# combine data by day
test2 = ts.resample('D').sum()
fig2 = test2.plot()

That produces the following plots:

first plot

second plot

Is this method not better than the 'groupby' function?

Now how do I make a scatter or bar plot instead of this line plot...?

DBinJP
  • 247
  • 5
  • 13
  • tried googling _Now how do I make a scatter or bar plot instead of this line plot...?_ – Evgeny Jun 22 '18 at 08:45
  • `resample` will have the same result and has shorter notation, but `groupby`is not explicit and very traditional (it is similar to SQL queary syntax), good for learnign to manipulate data. – Evgeny Jun 22 '18 at 08:54
  • Why is it better to use `groupby` than `resample`? Is the only reason to obtain familiarity with SQL syntax? – DBinJP Jun 25 '18 at 03:39
0

Short answer: you need .groupby(), not .resample(), as in this answer

Longer code:

import pandas as pd
from io import StringIO

doc = StringIO("""2017-11-03 07:30:00      NaN
2017-11-03 09:18:00      NaN
2017-11-03 10:00:00      NaN
2017-11-03 11:08:00      NaN
2017-11-03 14:39:00      NaN
2017-11-03 14:53:00      NaN
2017-11-03 15:00:00      NaN
2017-11-03 16:00:00      NaN
2017-11-03 17:03:00      NaN
2017-11-03 17:42:00    800.0
2017-11-04 07:27:00    600.0
2017-11-04 10:10:00      NaN
2017-11-04 11:48:00      NaN
2017-11-04 12:58:00    500.0
2017-11-04 13:40:00      NaN
2017-11-04 15:15:00      NaN
2017-11-04 16:21:00      NaN
2017-11-04 17:37:00    500.0
2017-11-04 21:37:00      NaN
2017-11-05 03:00:00      NaN
2017-11-05 06:30:00      NaN
2017-11-05 07:19:00      NaN
2017-11-05 08:31:00    200.0
2017-11-05 09:31:00    500.0
2017-11-05 12:03:00      NaN
2017-11-05 12:25:00    200.0
2017-11-05 13:11:00    500.0
2017-11-05 16:31:00      NaN
2017-11-05 19:00:00    500.0
2017-11-06 08:08:00      NaN""")

df = pd.read_csv(doc, sep='\\s{2,}', 
                 header=None, 
                 converters={'timestamp': pd.to_datetime}, 
                 names = ['timestamp', 'acetaminophen'],
                 engine='python')
df = df.set_index('timestamp')

#true, but rather ugly x axis line
df.plot.bar()

df1 = df.groupby(by=[df.index.date]).sum()
df1.plot.bar()

If you dates are not continious, you can create an empty dataframe with full timeindex and merge df1 with it.

Evgeny
  • 4,173
  • 2
  • 19
  • 39
  • Why do I need `groupby` instead of `resample`? – DBinJP Jun 25 '18 at 03:42
  • You can use either. `groupby` is bit more expressive (to me), also resample is just a wrapper around groudbyas [seen here](https://github.com/pandas-dev/pandas/blob/v0.23.1/pandas/core/generic.py#L6652-L6659). However your original error was not related to it. – Evgeny Jun 25 '18 at 05:19