Datetime dataframe need to be grouped by station ID and then use resampling method for each station ID

Question

I have a data frame in Datetime format which looks as follows:

def read_station_data( fileName ):
    '''Read the contents of the soil moisture data file into a Pandas DataFrame
    where the index is the observation date.  Return the dataframe.  '''
    data = pd.read_csv(fileName, sep='\t', index_col=0, skiprows=2)
    data.index = pd.to_datetime(data.index)
    return data

Outputof this look like:

            Sta  0-10cm  10-30  30-50  ...  130-150  150-170  170-190  190-200
Date                                   ...                                    
1981-02-19    1   37.10  74.15  79.53  ...    73.07    71.67    58.49    27.99
1981-02-24    1   33.28  69.96  76.91  ...    71.74    70.15    57.41    28.33
1981-03-02    1   32.37  66.66  73.27  ...    74.85    73.16    59.72    29.18
1981-03-09    1   31.97  64.64  71.31  ...    72.09    71.84    57.64    28.86
1981-03-17    1   26.23  63.04  70.06  ...    72.89    72.13    58.10    28.71
        ...     ...    ...    ...  ...      ...      ...      ...      ...
2004-06-30    5   31.72  69.89  73.18  ...    60.34    56.52    54.19    27.04
2004-06-30   11   33.35  58.07  62.65  ...    78.06    77.20    74.69    38.24
2004-06-30   13   27.16  52.77  59.70  ...    86.54    81.86    74.03    39.80
2004-06-30   15   23.94  60.69  76.37  ...    67.09    70.22    81.64    41.20
2004-06-30   82   23.66  41.70  67.54  ...    72.18    73.12    78.96    41.20

[8068 rows x 12 columns]

Now I added few more columns to it:

def compute_total_moisture( DataDF ):
    '''Sum the soil moisture per soil column, which has been measured as
    depth of water, so can simply be added together.  Also compute the 
    volumetric water content of the total soil column, by dividing by the 
    total depth (2000 mm) and multiplying by 100%.  Return the original 
    dataframe with two additional columns called 'Total Water Depth (mm)' 
    and 'Total VWC (%)'.'''
    DataDF['Total Water Depth (mm)'] = DataDF.iloc[:,1:12].sum(axis=1)
    DataDF['Total VWC (%)'] = (DataDF['Total Water Depth (mm)']/2000)*100
    return DataDF

Now I want to compute from this data annual average values for Total Water Depth grouping by "Sta" and then resample the data annually and sum "Total water depth (mm)".

def compute_average_moisture_by_station( DataDF, MetaDF ):
    '''Compute the annual average total soil moisture as a depth and as VWC
    for each station.  Add as columns to a copy of the station info dataframe.
    Also compute the annual seasonal average VWC for each station and add to
    the same new dataframe.  Returned dataframe has all of the original columns 
    from the station information file, plus two columns for annual average total
    soil moisture, and four columns for annual average seasonal VWC.'''
    metaDF_copy = MetaDF.copy()
    newDF = DataDF.copy()                    # copy dataframe 
    newDF = newDF.groupby('Sta')                    # group dataframe elements by station
    newDF.index = pd.to_datetime(newDF.index)       <<<---- ERROR IS COMING FROM THIS LINE
 
    # annual total water depth
    metaDF_copy['Annual Total Water Depth (mm)'] = newDF.resample("A(S)-SEP")['Date'].sum(['Total 
    Water Depth (mm)'])

I am getting an error of:

error: 'DataFrameGroupBy' object has no attribute 'index',

For reference how the metaDF dataframe looks:

                    Name Code    Lat    Lon  Altitude
No.                                                  
1              Bondville  BVL  40.05 -88.22       213
2     Dixon Springs-Bare  DXB  37.45 -88.67       165
3             Brownstown  BRW  38.95 -88.95       177
4     Orr Center (Perry)  ORR  39.80 -90.83       206
5                De Kalb  DEK  41.85 -88.85       265
6               Monmouth  MON  40.92 -90.73       229
8                 Peoria  ICC  40.70 -89.52       207
9            Springfield  LLC  39.52 -89.62       177
10            Belleville  FRM  38.52 -89.88       133
11            Carbondale  SIU  37.72 -89.23       137
12                 Olney  OLN  38.73 -88.10       134
13              Freeport  FRE  42.28 -89.67       265
14       Rend Lake (Ina)  RND  38.13 -88.92       130
15                Stelle  STE  40.95 -88.17       213
16                Topeka  MTF  40.30 -89.90       152
17               Oak Run  OAK  40.97 -90.15       229
34             Fairfield  FAI  38.38 -88.38       136
81             Champaign  CMI  40.08 -88.23       219
82   Dixon Springs-Grass  DXG  37.45 -88.67       165

I think there is alreadu DatetimeIndex, so you can remove `newDF.index = pd.to_datetime(newDF.index)`, if not use it after `newDF = DataDF.copy() ` — jezrael, Apr 08 '22 at 05:35
Even when I remove that line, I got this `error 'SeriesGroupBy' object has no attribute 'Total Water Depth (mm)` — Sukrati gautam, Apr 08 '22 at 05:39
`newDF = newDF.groupby('Sta')` doesnt return a dataframe but rather a groupby object that contains information about the groups. See documentation. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html — le_camerone, Apr 08 '22 at 05:41
@jezrael I tried what you suggested but got an type error `TypeError: 'DataFrameGroupBy' object does not support item assignment` and @le_camerone thanks I understand that but then how can group by station and then resample? — Sukrati gautam, Apr 08 '22 at 05:47
@jezrael I tried the solution you suggested and I got this error `ValueError: cannot reindex from a duplicate axis` — Sukrati gautam, Apr 08 '22 at 05:53
@Sukratigautam - testing and need `newDF['Annual Total Water Depth (mm)'] = newDF.groupby('Sta').resample("A(S)-SEP")['Total Water Depth (mm)'].transform('sum')` — jezrael, Apr 08 '22 at 06:00
It again gave the same error `raise ValueError("cannot reindex from a duplicate axis")` — Sukrati gautam, Apr 08 '22 at 06:08
@jezrael I think I am reading another post of this error [https://stackoverflow.com/questions/51953869/convenient-way-to-deal-with-valueerror-cannot-reindex-from-a-duplicate-axis] there is mention of `reset_index` I belive something similar might require in this case as well! but I am not sure — Sukrati gautam, Apr 08 '22 at 06:13
@Sukratigautam - is created column in `newDF` ? like `newDF['Annual Total Water Depth (mm)']` ? Because in your solution is assigned wrongly `metaDF_copy['Annual Total Water Depth (mm)']` — jezrael, Apr 08 '22 at 06:15
@jezrael The solution is assigned correctly, I need to add the column in `metaDF_copy` i will edit my question and add how the `metaDF` dataframe look for your refrence — Sukrati gautam, Apr 08 '22 at 06:18
@Sukratigautam - Sorry, but how is possible match both DataFrames? You cannot use assign `metaDF_copy['Annual Total Water Depth (mm)'] = newDF...` until same index values in both DataFrames. Here in `newDF` is `DatetimeIndex`, in `metaDF_copy` is `RanegIndex`, so it is expected failed assignmenet. — jezrael, Apr 08 '22 at 06:22
So do I need to reset the index? I want to do the annual average sum for each station from`newDF` and sum them up thereafter append it to for each station in `metaDF_copy` — Sukrati gautam, Apr 08 '22 at 06:26
@Sukratigautam - Is in both DataFrames same number of rows? `8068` ? and first row from `newDF` match first row from `metaDF_copy`, second row match second,..., `8068.row` is matching `8068.row`? — jezrael, Apr 08 '22 at 06:58
@Sukratigautam - If not, what is logic form match both DataFrames? Because if need `groupby+resample` need first merge both DataFrames, only mising logic how do it - because not same column for match, not same index for match... — jezrael, Apr 08 '22 at 07:00

jezrael · Answer 1 · 2022-04-08T06:00:22.943

There is already DatetimeIndex, so converting to datetimes is not necessary. Also newDF = newDF.groupby('Sta') return groupby object, so assign new column failed. You can chain groupby + resample and because create new column use GroupBy.transform with sum:

def compute_average_moisture_by_station( DataDF, MetaDF ):
    '''Compute the annual average total soil moisture as a depth and as VWC
    for each station.  Add as columns to a copy of the station info dataframe.
    Also compute the annual seasonal average VWC for each station and add to
    the same new dataframe.  Returned dataframe has all of the original columns 
    from the station information file, plus two columns for annual average total
    soil moisture, and four columns for annual average seasonal VWC.'''

    newDF = DataDF.copy()  # copy dataframe 
 
    # annual total water depth
    newDF['Annual Total Water Depth (mm)'] = newDF.groupby('Sta').resample("A(S)-SEP")['Total Water Depth (mm)'].transform('sum')
    return newDF

Datetime dataframe need to be grouped by station ID and then use resampling method for each station ID

1 Answers1