0

I have one pandas Dataframe like below:

import pandas as pd

df = pd.DataFrame({'name': ['AAA','BBB','CCC'],
        '2017-01-06': ['3','3','4'],
        '2017-01-13': ['2','1','5'],
        '2017-01-20': ['1','3','4'],
        '2017-01-27': ['8','3','5'],
        'average_count': ['4','3','5']})
df = df.reindex_axis(['name','2017-01-06','2017-01-13','2017-01-20','2017-01-27','average_count'], axis=1)
print df

  name 2017-01-06 2017-01-13 2017-01-20 2017-01-27 average_count
0  AAA          3          2          1          8             4
1  BBB          3          1          3          3             3
2  CCC          4          5          4          5             5

I want to one output dataframe with four columns : name,date,count,average_count.

  • name column contains name from the above dataframe.
  • date column contains four different dates per single name.
  • count column contains count values for respective date.
  • average_count contains four different average count values.

If the months first week it is then average count need to calculate with (count of first week) / 1.

For 2nd week, (count of first week+count of first week) / 2.

For 3rd week, (count of first week+count of second week+count of third week) / 3.

For 4th week, (count of first week+count of second week+count of third week+count of fourth week) / 4.

In one month maximum five weeks are available (Need to handle five week scenario as well).

Edit1: Average count value calculation

This average count value is truncated like if the value <= 2.49 i.e. 2 and value >= 2.50 i.e. 3.

Output Dataframe looks like below:

name     date       count   average_count
0   AAA  2017-01-06     3       3
1   AAA  2017-01-13     2       2
3   AAA  2017-01-20     1       2
3   AAA  2017-01-27     8       4
4   BBB  2017-01-06     3       3
5   BBB  2017-01-13     1       2
6   BBB  2017-01-20     3       3
7   BBB  2017-01-27     3       3
8   CCC  2017-01-06     4       4
9   CCC  2017-01-13     5       5
10  CCC  2017-01-20     4       3
11  CCC  2017-01-27     5       5
cs95
  • 379,657
  • 97
  • 704
  • 746
ketan
  • 2,732
  • 11
  • 34
  • 80

2 Answers2

3

You can stack the values and reset_index to get the dataframe of 4 columns i.e

def round_next(x):
    if x%1 == 0.5:
        return x+0.5
    else :
        return np.round(x)

ndf = df.set_index(['name','average_count']).stack().reset_index().rename(columns = {'level_2':'date',0:'count'})
ndf['date'] = pd.to_datetime(ndf['date'])

ndf['count'] =ndf['count'].astype(int) # Since they are in string format

#Thank you @Zero. Since they are dates appearing to be taken weekly once groupby cumcount() + 1  will do that work. 
#Incase you have missing weeks then I would suggest dt.week i.e ndf.groupby('name')['date'].dt.week

ndf['average_count'] = (ndf.groupby('name')['count'].cumsum()/(ndf.groupby('name')['count'].cumcount()+1)).apply(round_next)
   name  average_count       date  count
0   AAA            3.0 2017-01-06      3
1   AAA            3.0 2017-01-13      2
2   AAA            2.0 2017-01-20      1
3   AAA            4.0 2017-01-27      8
4   BBB            3.0 2017-01-06      3
5   BBB            2.0 2017-01-13      1
6   BBB            2.0 2017-01-20      3
7   BBB            3.0 2017-01-27      3
8   CCC            4.0 2017-01-06      4
9   CCC            5.0 2017-01-13      5
10  CCC            4.0 2017-01-20      4
11  CCC            5.0 2017-01-27      5

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • 1
    Do `g=ndf.groupby('name')`; `(g['count'].cumsum()/(g['count'].cumcount()+1))`, since dates are sorted anyway, and looks to be week wise. – Zero Sep 18 '17 at 10:17
  • Looks good. Add in the last column and you get my upvote. This was quite tough, I'll admit. – cs95 Sep 18 '17 at 10:23
  • (Add the rounding for the last column) - `np.round` – cs95 Sep 18 '17 at 10:25
  • I hate my internet connection – Bharath M Shetty Sep 18 '17 at 10:26
  • I'm sorry, I meant `np.round` but +1 anyway. – cs95 Sep 18 '17 at 10:26
  • @Bharathshetty- average_counts are wrong or right? For 'AAA' name, 2nd row, 3 + 2 = 5 / 2 = 2.5 you are rounding it to 2. ok. For 'AAA' name, 4th row, 3+2+1+8 / 4 = 14 / 4 = 3.5 your rounding it to 4. One time you getting previous value and another time next value? – ketan Sep 18 '17 at 10:47
  • If you want to apply ceil you can do that too. numpy provides a various rounding functions. Apply the one that suits. And its bit complicated to understand the problem. – Bharath M Shetty Sep 18 '17 at 10:49
  • @kit np.around might help you to get that. But as numpy documentation say it rounds off .5 to to nearest even value. :(. – Bharath M Shetty Sep 18 '17 at 10:54
  • @Bharathshetty- I want value <= 2.49 i.e. 2 and value >= 2.50 i.e. 3 this codition. Which pandas condition is suitable for this? – ketan Sep 18 '17 at 11:03
  • @kit I updated the answer. I think its better we define a function to overcome that. Hope it helps – Bharath M Shetty Sep 18 '17 at 11:44
2

Use df.melt, df.sort_values and df.reset_index for the first bit.

df2 = df.iloc[:, :-1].melt('name', var_name=['date'], value_name='count')\
                                        .sort_values('name').reset_index(drop=True)

# cleaning up OP's data 
df2['count'] = pd.to_numeric(df2['count'])
df2['date'] = pd.to_datetime(df2.date)

df2

   name       date  count
0   AAA 2017-01-06      3
1   AAA 2017-01-13      2
2   AAA 2017-01-20      1
3   AAA 2017-01-27      8
4   BBB 2017-01-06      3
5   BBB 2017-01-13      1
6   BBB 2017-01-20      3
7   BBB 2017-01-27      3
8   CCC 2017-01-06      4
9   CCC 2017-01-13      5
10  CCC 2017-01-20      4
11  CCC 2017-01-27      5

Now, you'll need to groupby name, get the cumsum of count and divide by the week number, which you can access by dt.week.

df2['average_count'] = np.round(df2.groupby('name')\
                   ['count'].cumsum() / df2.date.dt.week).astype(int)  
df2

   name       date  count  average_count
0   AAA 2017-01-06      3              3
1   AAA 2017-01-13      2              2
2   AAA 2017-01-20      1              2
3   AAA 2017-01-27      8              4
4   BBB 2017-01-06      3              3
5   BBB 2017-01-13      1              2
6   BBB 2017-01-20      3              2
7   BBB 2017-01-27      3              2
8   CCC 2017-01-06      4              4
9   CCC 2017-01-13      5              4
10  CCC 2017-01-20      4              4
11  CCC 2017-01-27      5              4
cs95
  • 379,657
  • 97
  • 704
  • 746
  • @Bharathshetty Typo, sorry. – cs95 Sep 18 '17 at 10:27
  • @Bharathshetty Yeah, you can see the OP provided the data as a string. – cs95 Sep 18 '17 at 10:30
  • I meant this .date.dt.week? without converting to datetime ? – Bharath M Shetty Sep 18 '17 at 10:31
  • @ cᴏʟᴅsᴘᴇᴇᴅ I was in such a hurry thinking you would answer first haha – Bharath M Shetty Sep 18 '17 at 10:38
  • @cᴏʟᴅsᴘᴇᴇᴅ- I want average_count logic like value <= 2.49 i.e. 2 and value >= 2.50 i.e. 3 this codition. Which pandas condition is suitable for this? – ketan Sep 18 '17 at 11:12
  • @kit It sounds like np.round, but you have to understand that there might be inaccuracies due to imprecision. For example, 2.5 on the console might actually be 2.4999999987 in memory which is rounded down, so you see 2 instead of 3. Those things cannot be helped. – cs95 Sep 18 '17 at 11:16
  • @cᴏʟᴅsᴘᴇᴇᴅ No Read this https://stackoverflow.com/questions/10825926/python-3-x-rounding-behavior and https://stackoverflow.com/questions/28617841/rounding-to-nearest-int-with-numpy-rint-not-consistent-for-5. numpy rounds 0.5 values to nearest even number – Bharath M Shetty Sep 18 '17 at 11:32
  • @Bharathshetty If I use the builtin `round` function it'll give me the same answer. – cs95 Sep 18 '17 at 11:35
  • They say that its IEEE standard to round off 0.5 to nearby even number – Bharath M Shetty Sep 18 '17 at 11:37