0

My data contains 700,00 rows

I have tried using a for-loop which took 30 hours. Please let me know faster way to get the result.

I am attaching the sample data set. Each row is unique with respective to Columns[period, dimname, facility, serv, cpt]. I want to find average for rolling months of column(gcr) against columns[period-dimname-facility-cpt]. (Last column(avg6month) contains desired result). For better understanding attached filter result set in JPEG format.

data.sort_values(by='period', inplace=True, ascending=True)
for fa in data.loc[(data.dimname == 'fac_cpt'), ].facility.dropna().unique():
    for pr in data.loc[(data.dimname == 'fac_cpt') & (data.facility == fa), ].cpt.dropna().unique():
        data.loc[(data.dimname == 'fac_cpt') & (data.facility == fa) & (data.cpt == pr), ['avg6monthgcr']]=round(data.loc[(data.dimname == 'fac_cpt') & (data.facility == fa) & (data.cpt == pr), ].gcr.rolling(6, min_periods=1).mean(), 4)

Sample_Data:

sample_data

Samples_Results:

result1 result2 result3 result4

1 Answers1

0

I managed to get what you need with vector operations, so it should be the fastest way possible.

import pandas as pd

data = pd.DataFrame({
    "period": [
        '3/1/2017', '3/1/2017', '3/1/2017', '3/1/2017', '3/1/2017', '3/1/2017', '3/1/2017',
        '4/1/2017', '4/1/2017', '4/1/2017', '4/1/2017', '4/1/2017', '4/1/2017', '4/1/2017'
    ],
    "dimname": [
        'fac_cpt', 'fac_cpt', 'fac_cpt', 'fac_cpt', 'fac_cpt', 'ser_cpt', 'ser_cpt',
        'fac_cpt', 'fac_cpt', 'fac_cpt', 'fac_cpt', 'fac_cpt', 'ser_cpt', 'ser_cpt'
    ],
    "facility": ['a', 'a', 'a', 'b', 'b', None, None, 'a', 'a', 'a', 'b', 'b', None, None],
    "cpt": ['p1', 'p2', 'p3', 'p1', 'p2', 'p1', 'p2', 'p1', 'p2', 'p3', 'p1', 'p2', 'p1', 'p1'],
    "ser": [None, None, None, None, None, 'c', 'c', None, None, None, None, None, 'd', 'd'],
    "gcr": [1, 10, 2, 3, 8, 12, 4, 4, 10, 2, 4, 11, 6, 2]
})
data.period = data.period.apply(pd.to_datetime)

data[["period", "dimname", "facility", "cpt", "gcr"]].groupby(
    ['dimname', 'facility', 'cpt']
).rolling(6, min_periods=1, on='period').mean().reset_index(
    3, drop=True
).reset_index().rename(columns={'gcr': 'avg6monthgcr'})
# Output:
  | dimname | facility | cpt | avg6monthgcr | period
----------------------------------------------------
0 | fac_cpt |        a |  p1 |          1.0 | 2017-03-01
1 | fac_cpt |        a |  p1 |          2.5 | 2017-04-01
2 | fac_cpt |        a |  p2 |         10.0 | 2017-03-01
3 | fac_cpt |        a |  p2 |         10.0 | 2017-04-01
4 | fac_cpt |        a |  p3 |          2.0 | 2017-03-01
5 | fac_cpt |        a |  p3 |          2.0 | 2017-04-01
6 | fac_cpt |        b |  p1 |          3.0 | 2017-03-01
7 | fac_cpt |        b |  p1 |          3.5 | 2017-04-01
8 | fac_cpt |        b |  p2 |          8.0 | 2017-03-01
9 | fac_cpt |        b |  p2 |          9.5 | 2017-04-01

I timed it on your dataset, but there was only a marginal gain, probably because all the initialization took the majority of the time, not the calculation, so you should give it a try on the full dataset.

# your method:
27.6 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# my method:
24.9 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

If you need to merge it back to the original dataframe, you should modify the code to keep the original index, because merging is faster on them, so it would look like so:

avg_data = data[["period", "dimname", "facility", "cpt", "gcr"]].groupby(['dimname', 'facility', 'cpt']).rolling(6, min_periods=1, on='period').mean().reset_index(level=3).reset_index(drop=True).set_index('level_3').rename(columns={'gcr': 'avg6monthgcr'}).drop('period', axis=1)

data.merge(avg_data, left_index=True, right_index=True, how='left')
Hodossy Szabolcs
  • 1,598
  • 3
  • 18
  • 34
  • Thank you for the response. Sorry it is not giving any results. Please reread my problem, I have rewritten my problem statement. – Raghavendra S Jan 15 '19 at 02:36
  • I added my results when I run the above code, and it is giving exactly what you need. Could you add why it is not giving results for you? Is there an exception? – Hodossy Szabolcs Jan 15 '19 at 07:26
  • I have Columns [period, dimname, facility, ser, cpt, gcr] and I need Column[avg6month]<-average of 6 month gcr(column name). To get avg6month I used for-loop which is taking huge time. Now I need other method which can do in much faster way. Thank you – Raghavendra S Jan 16 '19 at 04:46
  • I have not bothered with the rename, but I have the `gcr` column containing the averages. I have updated the answer. – Hodossy Szabolcs Jan 16 '19 at 09:13
  • Thank you so much. I got it. Could you please help to append the same(average of 6 months) at the end of the dataframe(As new column) – Raghavendra S Jan 18 '19 at 13:51
  • I have updated my answer, please mark it as an answer if it solved your issue. – Hodossy Szabolcs Jan 21 '19 at 08:25
  • Thank you. It is working greatly. Again If I want to get the average gcr for (dimname == 'ser_cpt'). How to do?. If I follow same code and It will be difficult to overwrite column over column. If it is the only way to do, then please give the easiest way to merge(one over other column). I have updated my results in the Samples_data, Result1, Result2, Result3, Result4 with what is explaining. – Raghavendra S Jan 21 '19 at 12:26
  • Actually it should work for all `dimname` values without any modification. – Hodossy Szabolcs Jan 21 '19 at 14:09
  • No. It is not working for all the dimension at a time{giving null results if I run including 2dimensions(groupby('dimname', 'facility', ser', 'cpt'))} and giving good results if I run for 1dimension separately(1.groupby('dimname', 'facility', 'cpt')) 2. groupby('dimname', 'ser', 'cpt')) ). If I do individually it is very difficult to merge two columns later after. Please let me know solution for this – Raghavendra S Jan 22 '19 at 05:50
  • That is because you have `NaN`s in your data, namely `facility` is not given, and those are excluded in a `groupby`. See [groupby columns with NaN (missing) values](https://stackoverflow.com/questions/18429491/groupby-columns-with-nan-missing-values). You can replace those values with a neutral one to make this work. – Hodossy Szabolcs Jan 22 '19 at 10:22