3

I have a pandas dataframe in the following format that I would like to reduce.

Index                CITY   YEAR   MONTH   HOUR   TEMP   DP   CC   ASTR   BSTR
2018-04-10 00:00:00  AAA    2018   4       1      20     10   0    hd     pv
2018-04-10 01:00:00  AAA    2018   4       2      10     10   10   fg     pv
...
2018-04-10 00:00:00  BBB    2018   4       1      30     20   5    cv     er
2018-04-10 01:00:00  BBB    2018   4       2      20     20   5    hd     fd
...
2018-04-10 00:00:00  CCC    2018   4       1      20     30   10   cv     fd
2018-04-10 01:00:00  CCC    2018   4       2      10     10   5    df     ee
...

Then the output should be something like:

Index                 YEAR   MONTH   HOUR   TEMP   DP   CC   ASTR   BSTR
2018-04-10 00:00:00   2018   4       1      24     20   5    cv     er
2018-04-10 01:00:00   2018   4       2      14     14   6.5  hd     fd
...

Where the TEMP, DP, and CC columns are a weighted average of each CITY's value for each unique index (the index is a datetime), and ASTR and BSTR is just the value of BBB. YEAR, MONTH, and HOUR should stay the same. The weights for the above example could be given by the following dictionary: weights = {"AAA" : 0.3, "BBB" : 0.4, "CCC" : 0.3}

My actual data has over 4 years worth of hourly data for multiple cities, so I would like some help coming up with an efficient solution, thanks!

Davie88
  • 93
  • 6
  • 1
    Welcome to Stack Overflow! Could you please post a [Minimal, Complete, and Verifiable](https://stackoverflow.com/help/mcve) example of your attempt and say specifically where you're stuck. – brass monkey Aug 21 '18 at 15:59
  • Possible duplicate: https://stackoverflow.com/questions/31521027/groupby-weighted-average-and-sum-in-pandas-dataframe – DYZ Aug 21 '18 at 16:09
  • Hi, I'm pretty new to using pandas, so I'm not exactly sure where to start with something like this. My first thought would be to do something more "manual" by breaking up the dataframe into number of cities dateframes, extracting each of the columns, putting them into a matrix, multiplying each of the columns of the matrix by the corresponding weight, and lastly adding each value in a row together. I'm wondering if there is a better way to do this? – Davie88 Aug 21 '18 at 16:10

1 Answers1

1

Given the following dataframe:

df = pd.DataFrame([['2018-04-10 00:00:00',  'AAA',    2018,   4,       1,      20,     10,   0,    'hd',     'pv'],
                ['2018-04-10 01:00:00',  'AAA',    2018,   4,       2,      10,     10,   10,   'fg',     'pv'],
                ['2018-04-10 00:00:00',  'BBB',    2018,   4,       1,      30,     20,   5,    'cv',     'er'],
                ['2018-04-10 01:00:00',  'BBB',    2018,   4,       2,      20,     20,   5,    'hd',     'fd'],
                ['2018-04-10 00:00:00',  'CCC',    2018,   4,       1,      20,     30,   10,   'cv',     'fd'],
                ['2018-04-10 01:00:00',  'CCC',    2018,   4,       2,      10,     10,   5,    'df',     'ee']],
                columns = ['Index','CITY','YEAR','MONTH','HOUR','TEMP','DP','CC','ASTR','BSTR'])

And weights:

weights = {"AAA" : 0.3, "BBB" : 0.4, "CCC" : 0.3}

You can do the following:

w_a = lambda x: np.average(x, weights=list(weights.values()))
new = df.groupby('Index').agg({'TEMP': w_a, 'DP': w_a, 'CC': w_a})
new = new.join(df[df['CITY']=='BBB'][['Index','YEAR','MONTH','HOUR','ASTR','BSTR']].set_index('Index'), on='Index')

Which gives:

                     TEMP  DP   CC  YEAR  MONTH  HOUR ASTR BSTR
Index                                                          
2018-04-10 00:00:00    24  20  5.0  2018      4     1   cv   er
2018-04-10 01:00:00    14  14  6.5  2018      4     2   hd   fd

Note that this solution assumes that the keys in your weights dictionary are provided in the same order that the cities appear in the CITY column of your original df (i.e. sorted alphabetically).

rahlf23
  • 8,869
  • 4
  • 24
  • 54
  • 1
    Thanks for the solution, after adding a column called Index (I guess I was unclear in my post that Index meant the index of the dataframe) and then sorting it by City and the Index, this solution worked perfectly! The code I used to the above two steps I mentioned: df['Index'] = df.index df = df.sort_values(by = ['CITY', 'Index' ], axis = 0) – Davie88 Aug 21 '18 at 17:46