Data sliced incorrectly on year level?

Question

I am supposed to calculate the monthly average temperatures for the entire data (i.e. for each year separately). My data contains daily logs of temperature from 1952 to 2017:

fp = "data/1091402.txt"
data = pd.read_csv(fp, skiprows= [1], sep='\s+', na_values=['-9999'] )

data['DATE_str'] = data['DATE'].astype(str)

data['DATE_month'] = data['DATE_str'].str.slice(start=0, stop=6)

data['DATE_month'] = data['DATE_month'].astype(int)

grouped_month = data.groupby('DATE_month')

I think the expected amount of months should be lower then 780 ( 65 years times 12 months) but it gives me 790 month (which for sure is not true, because my data ends in April). Problem actually starts already with years which after slicing and grouping suppose to be 65 and it gives me 66. Where did I make a mistake?

does your data include logs in 2017? because there are 66 years from 1952 to 2017, if you include 2017. — Kevin Wang, Mar 26 '19 at 18:27
Yups.. I forgot that in need to include 2017 as well :) Thanks :) — Anna Jakubowska, Mar 26 '19 at 18:43

score 0 · Answer 1 · edited Dec 05 '21 at 02:24

Pandas provides powerful data support and built in date rules. Relying on the input data from a spreadsheet (you do not provide an example of yours) is risky. So trying to manipulate the index by concatenating years-month as a String may not be the optimal solution. For example, you can have erroneous date or month values such as a month 0 or year 11999. So getting the date-month into a date type will be helpful.

When your data is loaded, go after it with the proper index type and use a Pandas grouper (as you do above) to which you can apply the desired statistical function. Here is an MCVE you can run on your side to see how things work:

import pandas as pd
import random as r

daterange = pd.date_range('1952-01-01', '2018-01-01', freq='MS')
df = pd.DataFrame(index=daterange, data={'values': [r.randint(-10,110) for i in range(len(daterange))]})
print(type(df.index))
grouper = df.groupby([df.index.year, df.index.month])
print(grouper.mean())

The grouper can then be used to ask questions about how many years there are, as well as months. This is what you do above, but without aid of the DateTimeIndex.

Getting your csv loaded can also involve converting the dates that are within it. A good example of that strategy can be found here: Reading a csv with a timestamp column, with pandas

Data sliced incorrectly on year level?

1 Answers1