8

I have a text file with four columns: year, month, day and snow depth. This is daily data for a 30-year period, 1979-2009.

I would like to calculate 360 (30yrs X 12 months) individual monthly averages using pandas (i.e. isolating all the values for Jan-1979, Feb-1979,... Dec-2009 and averaging each). Can anyone help me out with some example code?

1979    1   1   3
1979    1   2   3
1979    1   3   3
1979    1   4   3
1979    1   5   3
1979    1   6   3
1979    1   7   4
1979    1   8   5
1979    1   9   7
1979    1   10  8
1979    1   11  16
1979    1   12  16
1979    1   13  16
1979    1   14  18
1979    1   15  18
1979    1   16  18
1979    1   17  18
1979    1   18  20
1979    1   19  20
1979    1   20  20
1979    1   21  20
1979    1   22  20
1979    1   23  18
1979    1   24  18
1979    1   25  18
1979    1   26  18
1979    1   27  18
1979    1   28  18
1979    1   29  18
1979    1   30  18
1979    1   31  19
1979    2   1   19
1979    2   2   19
1979    2   3   19
1979    2   4   19
1979    2   5   19
1979    2   6   22
1979    2   7   24
1979    2   8   27
1979    2   9   29
1979    2   10  32
1979    2   11  32
1979    2   12  32
1979    2   13  32
1979    2   14  33
1979    2   15  33
1979    2   16  33
1979    2   17  34
1979    2   18  36
1979    2   19  36
1979    2   20  36
1979    2   21  36
1979    2   22  36
1979    2   23  36
1979    2   24  31
1979    2   25  29
1979    2   26  27
1979    2   27  27
1979    2   28  27
R_Dax
  • 706
  • 3
  • 10
  • 25
DJV
  • 863
  • 3
  • 15
  • 30
  • Can you add example data? – dawg Apr 21 '15 at 03:49
  • possible duplicate of [Compute daily climatology using pandas python](http://stackoverflow.com/questions/20973527/compute-daily-climatology-using-pandas-python) – cphlewis Apr 21 '15 at 03:49
  • @dawg Added two months worth of data. Like I said, though, I have it going through Dec. 31, 2009. – DJV Apr 21 '15 at 04:01

1 Answers1

11

You'll want to group your data by year and month, and then calculate the mean of each group. Pseudo-code:

import numpy as np
import pandas as pd

# Read in your file as a pandas.DataFrame
# using 'any number of whitespace' as the seperator
df = pd.read_csv("snow.txt", sep='\s*', names=["year", "month", "day", "snow_depth"])

# Show the first 5 rows of the DataFrame
print df.head()

# Group data first by year, then by month
g = df.groupby(["year", "month"])

# For each group, calculate the average of only the snow_depth column
monthly_averages = g.aggregate({"snow_depth":np.mean})

For more, about the split-apply-combine approach in Pandas, read here.

A DataFrame is a:

"Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)."

For your purposes, the difference between a numpy ndarray and a DataFrame are not too significant, but DataFrames have a bunch of functions that will make your life easier, so I'd suggest doing some reading on them.

Zachary Cross
  • 2,298
  • 1
  • 15
  • 22
  • Could you provide a little more information on the data frame? I'm used to np.genfromtxt which reads in each column individually, but every tutorial I see has the year, month and day combined into one variable from the onset. – DJV Apr 21 '15 at 04:18
  • I've updated my answer to include a working example, and added a little bit of information regarding DataFrames. Suffice it to say, if you end up using Pandas for data analysis, you will almost certainly end up using DataFrames at some point. – Zachary Cross Apr 21 '15 at 04:49
  • Awesome. This helps immensely. And yeah, I do plan on doing more with Pandas so this is a great jumping-off point. – DJV Apr 21 '15 at 04:54
  • 1
    what is the index is `datetime64` timestamps instead of having year and month as separate columns? – endolith Jul 03 '16 at 18:13