1

I was wondering if someone could help me out with the following problem :

I'm given a .csv file which holds the information of wave height measurements from the past few years (2010 - 2016) (see image below)

Dominant wave heights:

enter image description here

Now the problem states that we compute the annual mean from the given time series with the use of a double loop, wherein the so called 'outer loop' will loop through the years of record and the 'inner loop' will loop through the data and add all the maximum wave heights for the year under consideration (while keeping track of the number of values for that year).

At this point I was able to create the dataframe from the .csv file with the use of read_csv and cancel out any unnecessary columns with the 'del'function:

    df = read_csv('PR1_waves_triaxys_historical.csv', parse_dates=[0], delimiter=',', skipinitialspace=True)

Since I'm quite much new to python (and programming at all), I would really appreciate any form of help!

UPDATE: the dataframe is right now in the following form:

    Time    max_wave_height
0   2010-07-13 11:00:00 1.60
1   2010-07-13 12:00:00 1.41
2   2010-07-13 13:00:00 1.78
3   2010-07-13 14:00:00 1.98
4   2010-07-13 15:00:00 2.22
5   2010-07-13 16:00:00 1.96
DavidG
  • 24,279
  • 14
  • 89
  • 82
Thijs
  • 33
  • 4
  • 1
    Could you provide a sample (3-5 rows) DF in text/CSV form and desired data set? Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – MaxU - stand with Ukraine Mar 10 '17 at 10:28
  • 1
    you can probably do this by just `df.groupby(df['Time'].dt.year)['max_wave_height'].max()` but not sure if it needs more – EdChum Mar 10 '17 at 10:35

2 Answers2

0

IIUC you can do it this way:

In [180]: df
Out[180]:
                 Time  max_wave_height
0 2010-07-13 11:00:00             1.60
1 2010-07-13 12:00:00             1.41
2 2010-07-13 13:00:00             1.78
3 2010-07-13 14:00:00             1.98
4 2010-07-13 15:00:00             2.22
5 2010-07-13 16:00:00             3.96
6 2011-07-13 15:00:00             3.22
7 2011-07-13 16:00:00             5.55

In [181]: df.groupby(df.Time.dt.year)['max_wave_height'].agg(['count','max'])
Out[181]:
      count   max
Time
2010      6  3.96
2011      2  5.55
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Thanks for your quick response! But in your solution your not using any form of the double 'for loop', right? Correct me if I'm wrong! – Thijs Mar 10 '17 at 10:42
  • @T_R, the whole idea of pandas/numpy/scipy is to use fast vectorized solutions instead of loops. It's usually orders of magnitude faster... – MaxU - stand with Ukraine Mar 10 '17 at 10:44
  • I totally get that! But unfortunately I'm asked do it with the use of a double loop, which is where I got stuck. – Thijs Mar 10 '17 at 10:47
0

the solution to the problem finally was computed in the following form:

df = read_csv('PR1_waves_triaxys_historical.csv', parse_dates=[0], delimiter=',', skipinitialspace=True)

del (df['dominant_wave_period'], df['mean_wave_direction'], df['significant_wave_height'], df['mean_wave_direction_spread'])

for j in range(2010,2017):
    print (j)
    A = 0
    B = 0 

    for i in range(len(df)):

        if df.iloc[i,0].year == j:
            A = df.loc[i,'max_wave_height'] + A
            B = B + 1
            mean = A/B
    print (A)
    print (B)
    print (mean)    
Thijs
  • 33
  • 4