1

I'm currently writing a code that takes in a .csv file that appears as so:

724070 93730 19800101   0   330 1.5 22000   -1.7    -5      1013.6  78
724070 93730 19800101   100 230 1.5 22000   -2.7    -5.5    1013.7  81
724070 93730 19800101   200 0   0   22000   -3.8    -4.9    1013.9  92
724070 93730 19800101   300 340 1.5 22000   -5.6    -6.1    1013.6  96
724070 93730 19800101   400 0   0   22000   -6.6    -7.7    1013.6  92
724070 93730 19800101   500 330 1.5 22000   -7.1    -8.8    1013.6  88

Where the first two columns are identifiers, the third column is the date, the fourth column is the hour and the last seven columns are values of interest. My end goal is to have a daily averaged values for the last seven columns for every day of the year.

I tried messing around by manipulating the data in only arrays, but I was convinced to go the route of pandas, so my code is fairly new. So far I have:

import pandas as pd

csv = raw_input('What is the name of your file? ') 

cols = ['USAF','NCDC','DATE','HR','WND DIR','WND SPD', 'SKY CVR','TMPC','TMDC','PRES','RH']
data = pd.read_csv(csv, header = None, parse_dates = [['DATE', 'HR']],  names = cols)

I'm having trouble stepping off from here since I'm just learning pandas, and I would appreciate some help -- the other questions that I viewed have yet to be helpful.

1st) There are three unique "USAF" identifiers in this .csv file, is there any way I can separate this data frame into three data frames, which are determined by the USAF column?

2nd) pandas is having a hard time recognizing my date and time format, which will not allow me to move further with the calculating the averages. How do I mitigate this?

Thanks in advance

climatefreak
  • 47
  • 1
  • 8

1 Answers1

2

Creating mean values by observations is fairly simple. Notice that this is not a concept that is specific to dates, you basically want to create mean-values using some values as group-identifier. Standard code for this is

df = pd.DataFrame(data)
means = df.groupby('DATE').mean()

If you want to separate your data based on three values 'a1', 'a2', 'a3' of a column called 'A', one way to proceed would be

data1 = df[df['A'] == 'a1']
data2 = df[df['A'] == 'a2']
data3 = df[df['A'] == 'a3']

You can do this onto any dataframe - also the one that I earlier called means. However, if the calculations that you want to do are the same for the different stations, it does not make sense to separate the data sets. What I would rather do is keep the dataset together, do all the operations, and do not split before looking at results and/or plotting. That is cleaner, imo.

As for identifying columns as dates, I believe this is a question that has been asked (and answered) quite often here.

FooBar
  • 15,724
  • 19
  • 82
  • 171