0

I have a really long list (10 years) of hourly values and I would like to average column 3, per day. Such that each date will have an average value derived from 3rd column.

My data loooks like this:

>     1/1/2005,16:00:00,83.3971,-3.8950
>     1/1/2005,17:00:00,0.0000,-3.9146
>     1/1/2005,18:00:00,0.0000,-3.9337
>     1/1/2005,19:00:00,0.0000,-3.9532
>     1/1/2005,20:00:00,0.0000,-3.9727
>     1/1/2005,21:00:00,0.0000,-3.9920
>     1/1/2005,22:00:00,0.0000,-4.0116
>     1/1/2005,23:00:00,0.0000,-4.0311
>     1/2/2005,0:00:00,0.0000,-4.0503
>     1/2/2005,1:00:00,0.0000,-4.0697
>     1/2/2005,2:00:00,0.0000,-4.0891
>     1/2/2005,3:00:00,0.0000,-4.1083
>     1/2/2005,4:00:00,0.0000,-4.1279
>     1/2/2005,5:00:00,0.0000,-4.1472
>     1/2/2005,6:00:00,0.0000,-4.1662
>     1/2/2005,7:00:00,0.0000,-4.1858
>     1/2/2005,8:00:00,0.0000,-4.2053
>     1/2/2005,9:00:00,152.7058,-4.2242
>     1/2/2005,10:00:00,302.6400,-4.2436
>     1/2/2005,11:00:00,405.2218,-4.2630
>     1/2/2005,12:00:00,452.6208,-4.2821
>     1/2/2005,13:00:00,441.4662,-4.3016
>     1/2/2005,14:00:00,372.5459,-4.3208
>     1/2/2005,15:00:00,250.8291,-4.3398
>     1/2/2005,16:00:00,86.6172,-4.3592
>     1/2/2005,17:00:00,0.0000,-4.3785
>     1/2/2005,18:00:00,0.0000,-4.3973
>     1/2/2005,19:00:00,0.0000,-4.4167
>...
12/30/2014,23:00:00,0.0000,0.7601
12/31/2014,0:00:00,0.0000,0.7601
12/31/2014,1:00:00,0.0000,0.7601
12/31/2014,2:00:00,0.0000,0.7601
12/31/2014,3:00:00,0.0000,0.7601
12/31/2014,4:00:00,0.0000,0.7601
12/31/2014,5:00:00,0.0000,0.7601
12/31/2014,6:00:00,0.0000,0.7601
12/31/2014,7:00:00,0.0000,0.7601
12/31/2014,8:00:00,0.0000,2.6808
12/31/2014,9:00:00,153.8084,1.6338
12/31/2014,10:00:00,301.9711,1.3491
12/31/2014,11:00:00,402.5888,1.2512
12/31/2014,12:00:00,447.9860,1.2191
12/31/2014,13:00:00,434.9283,1.2277

...

This may be an excellent opportunity to highlight the "Split, Apply, Combine" premise and with a simple case use?

Perhaps Read csv into pandas, index as a datetime object, then groupby day, aggregate sum/divide by count (aka average)?

QUESTION: I need the average daily value and I am starting with the above 10-year, hourly time series. As in, I have an hourly dataset going from Jan 1 2005 to Dec 31 2014, and I want the average daily value based on the 10 years of daily averages from that dataset. You dig?

I have already gone from hourly to daily using:

df = pd.read_csv('file.csv', parse_dates='datetime':0,1]},index_col='datetime', header=True, usecols=[0,1,2])

day_avgs = df.groupby(pd.TimeGrouper('D'))

This returns average daily values, indeed, see below:

date  

2005-01-01  106.307291
2005-01-02  102.578729
2005-01-03  103.332883
2005-01-04  104.139979
2005-01-05  104.999592
... ...
2014-12-02  108.292092
2014-12-03  107.189729
2014-12-04  106.142721
2014-12-05  105.151696

However, I am stumped as to how I can group these daily values in "day_avgs", so group on each date (10 of them) and then average to give one daily average that is the average of all of those individual dates over the 10 year dataset. Capiche?

ie, I would like to have the average of every day (365) in a year, based on the 10 years of daily averages.

geokrowding
  • 621
  • 1
  • 6
  • 13
  • Why divide by 8? you have an additonal 8 observations, are you wanting to discount the 0.0000 values – EdChum Feb 27 '15 at 13:07
  • 1
    Also your question has lots of questions, this is something that is discouraged on SO, ideally 1 question per post so you'll need to edit your question – EdChum Feb 27 '15 at 13:12
  • My question is only one, but there are steps, no doubt. I can handle an average including the zeros for the day, or without, whatever is easiest for respondents. The -8 was only an example of an average calculation. I think this question (again, only 1) is worth not-editing, as I am sure answers will go a long way to helping others. Thanks – geokrowding Feb 27 '15 at 20:45

1 Answers1

0

Find average for each day of the year

#!/usr/bin/env python
from datetime import datetime
import pandas

def same_day(date_string): # remove year
    return datetime.strptime(date_string, "%m/%d/%Y").strftime('%m-%d')

df = pandas.read_csv('input.csv', index_col=0,
                     usecols=[0,2], names=['date', 'value'],
                     converters={'date': same_day})
print(df.groupby(level=0).mean())

Output

            value
date             
01-01  143.991035
01-02  123.232340
12-30    0.000000
12-31  100.981233

It assumes that all hourly values have the same weight in different years.

Find average for each date

pandas allows duplicate values in index.

To group the data by date (1st column) and to find the mean of the 3rd column:

#!/usr/bin/env python
import pandas

df = pandas.read_csv('input.csv', parse_dates=True, index_col=0,
                     usecols=[0,2], names=['date', 'value'])
print(df.groupby(level=0).mean())

Output

                 value
date                  
2005-01-01  143.991035
2005-01-02  123.232340

[2 rows x 1 columns]

The code that uses itertools.groupby() produces the same result:

#!/usr/bin/env python
import csv
from collections import OrderedDict
from datetime import datetime
from itertools import groupby
from operator import itemgetter
from pprint import pprint

def groupby_mean(file):
    mean = OrderedDict()
    for day, same_day_rows in groupby(csv.reader(file), key=itemgetter(0)):
        L = [float(row[2]) for row in same_day_rows]
        mean[datetime.strptime(day, '%m/%d/%Y')] = sum(L) / len(L)
    return mean

with open('input.csv') as file:
    pprint(groupby_mean(file))

Output

{datetime.datetime(2005, 1, 1, 0, 0): 143.99103529411764,
 datetime.datetime(2005, 1, 2, 0, 0): 123.23234}

math.fsum(L) leads to the same result as sum(L) with your input.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Close, but not quite, but it was my fault as I did not explain clearly enough. My question has been updated. – geokrowding Mar 05 '15 at 07:40
  • @geokrowding: let's simplify. My code uses the data from your question. It produces two rows. What result do you expect instead for that data? – jfs Mar 05 '15 at 08:00
  • I think the question was clear. The data is for 10 years, not one. Your method does not provide an answer to this, sorry. Also, level=0 did not work using my data, so not sure how come it worked for you. I accomplished the same thing doing the code I edited above. Thanks for your attempt, and please feel free to take it to the one step further, simply average all the average days...that's the question, simple. – geokrowding Mar 05 '15 at 08:10
  • @geokrowding: your input data spans only two days. You've changed the question. All the code is tested. It works on both Python 2 and 3. `pandas.__version__` is `0.13.1`. – jfs Mar 05 '15 at 08:12
  • I have not changed the question, but I did better qualify it. I assumed it was not advisable to include 10 years worth of hourly values in this post. I better qualified my statement of "very long list" by actually stating its length - 10 years and adding "..." at the end of it. I hope this clears up any misunderstanding. If you are still unclear, please refer to the latter sample of data which includes the head and tail of the dataset. – geokrowding Mar 05 '15 at 08:15
  • @geokrowding: no. The piece that changes the question is that you want to combine all values for the same month/day (leap year?) in different years. You don't need to include 10 years of data: include a couple of lines from several years for couple of dates in each year – jfs Mar 05 '15 at 08:23
  • no, I do not want to "combine all values for the same month/day (leap year?) in different years". Please re-read the updated question as it is put quite clearly there. Put simply: **I need every 10 daily averages to be averaged. ** so output dataset will be 365 values, each a daily average. – geokrowding Mar 05 '15 at 08:26
  • @geokrowding: you misunderstood e.g., if you have hourly values 2005-12-01 and 2011-12-01 you want to combine them into a single average because both 12-01 (same month/day -- it would be wrong to say "day of the year" because 12-01 may be different day due to leap years -- perhaps you could say "for the same Julian day of the year (not counting 02/29)") – jfs Mar 05 '15 at 08:42
  • no I understood correct, and that is why I suggest a review of my question. I cannot word it any more simply for ya. I never mentioned leap year or Feb 29, as I am not worried about them. You use the term Julian Day incorrectly, but that is commonly done. In fact, pandas is guilty of doing that too as if you want to reference "day of year" using strftime you use %j (implying julian, but incorrectly.) For example, Julian day for Jan 1, 2000 is 2,451,545. Not 001. See here [link](http://en.wikipedia.org/wiki/Julian_day) – geokrowding Mar 05 '15 at 08:49
  • @geokrowding: [Julian day may mean different things -- I've used the meaning near `Jn` in this post (as it is used in Python `time` module documentation](http://stackoverflow.com/a/25831416/4279). The link in that post points to IERS that I consider enough of the authority to define what JDN or JD mean (more than wikipedia). – jfs Mar 05 '15 at 08:55
  • sure, point taken, but I never even used the term, and have no need to. If you really want to use the term, if that what will help convey the question, then I am after "Julian Day" or "Day of Year" average based on a 10-year dataset of hourly values. That may help? – geokrowding Mar 05 '15 at 08:59
  • @geokrowding: that is what I meant to say [here](http://stackoverflow.com/questions/28765563/average-values-from-a-column-on-an-hourly-timeseries/28816124?noredirect=1#comment46008859_28816124): combine all values for the same day of the year (not counting leap days) in different years e.g., `sum(data['12-01'])/len(data['12-01'])` will give you average (assuming each day has all hours otherwise you have to decide whether to find 2005-12-01 average first and then to combine it with 2011-12-01 average or just use the same weight for all hours values. – jfs Mar 05 '15 at 09:06
  • looks like I am going to have to burn this question...down voted and terribly misunderstood. Sorry for any confusion and thanks for your input. – geokrowding Mar 05 '15 at 10:12
  • @geokrowding: I've updated the question to show how to get the average for the same "day of year" (same m/d). Your question is unclear even after I've answered it. – jfs Mar 05 '15 at 11:05
  • @geokrowding: my code works for the input in your question (if you remove '...' in the middle). It uses `'%m/%d/Y'` - month, day, year format. Your traceback shows `'%d/%m/%y'` - day, month, year. – jfs Mar 05 '15 at 11:51
  • @geokrowding: you've changed the input format without an apology. Good luck. – jfs Mar 05 '15 at 11:56
  • I actually had it the right way, but pasted the wrong traceback. It was the fact that I have a header and forgot to = True :) I have reworked my question in hopes of making it more clear and salvage some of this as I can't delete the question after having it answered, apparently. Yes, I tried. – geokrowding Mar 05 '15 at 12:01
  • Sorry? What are you referring to? You have been very impatient and quite unclear and misinterpreting with me. Not very nice to have down voted either, this is equally (arguably more) due to your convolution of the question. Is there a language barrier at play here, by chance? Otherwise, I am not sure what's going on here. Thanks for the attempt at helping & effort. – geokrowding Mar 05 '15 at 12:04