1

I have a dataframe consisting of columns Name (names), value (the week in 2016 that an event occurred), binary (an indication that the event occurred, "1"), for example:

df 
    Name      value      binary
    apple     2016 W16   1
    orange    2016 W17   1
    melon     2016 W20   1
    berry     2016 W17   1
    lime      2016 W19   1

I am interested in adding rows to this dataframe so that each Name (apple, orange, etc.) has an item in the value column for the weeks before an event occurred. Again, the week that the event occurred is stated as the value column in df. The time period of interest is the weeks between 2016 W16 and 2016 W19, e.g.:

start_end_weeks
     week
     2016 W16
     2016 W17
     2016 W18
     2016 W19

My problem is that I need to in-fill the rows with weeks between 2016 W16 and 2016 W19 that are not represented in df. Here's what I mean:

df_result 
    Name      value      binary
    apple     2016 W16   1
    orange    2016 W16   0
    orange    2016 W17   1
    melon     2016 W16   0
    melon     2016 W17   0
    melon     2016 W18   0
    melon     2016 W19   0
    melon     2016 W20   1
    berry     2016 W17   1
    lime      2016 W19   1
    ...

But since value isn't a traditional date time object, I'm not sure how to get python to recognize that 2016 W16 occurs before 2016 W17, and then to only in-fill the values before the week stated in value in df.

I am having trouble with where to start, so if someone could help me convert value to a date time object that would be great, and I can go from there. Any other insight appreciated.

I found this stack overflow question which is all I have so far: Match rows in one Pandas dataframe to another based on three columns.

JAG2024
  • 3,987
  • 7
  • 29
  • 58
  • Does this help you get started? https://stackoverflow.com/questions/17087314/get-date-from-week-number – wkzhu Dec 07 '17 at 00:15
  • 1
    Given that the string structure of your week value is sorted in order of higher significance, any lexicographical comparison will work (i.e. `"2016 W15" < "2016 W28"` evaluates to `True`). I'm not sure I understand what your exact problem is, tho. I'll have to re-read your question a few times... – zwer Dec 07 '17 at 00:19

2 Answers2

2

You can get a valid datetime date as follows:

from datetime import datetime

df['year'] = df.value.str.split(' W').str[0]
df['week'] = df.value.str.split(' W').str[1]
df['date'] = df.apply(lambda x: datetime.strptime(
    x.year + '-' + x.week + '-0', '%Y-%W-%w'), axis=1)

And the result will be:

    Name    value  binary   year    week    date
0   apple   2016 W16    1   2016    16      2016-04-24
1   orange  2016 W17    1   2016    17      2016-05-01
2   melon   2016 W20    1   2016    20      2016-05-22
3   berry   2016 W17    1   2016    17      2016-05-01
4   lime    2016 W19    1   2016    19      2016-05-15
wkzhu
  • 1,616
  • 13
  • 23
1

If you need convert to date

import datetime
df.value.apply(lambda x : datetime.datetime.strptime(x + '-0', "%Y W%W-%w"))
Out[133]: 
0   2016-04-24
1   2016-05-01
2   2016-05-22
3   2016-05-01
4   2016-05-15
Name: value, dtype: datetime64[ns]

without change the date I am going to use reindex with groupby

df1=df.groupby('Name')['binary','value'].apply(lambda x : x.set_index('value').\
           reindex(['2016 W16','2016 W17','2016 W18','2016 W19','2016 W20']))

df1.loc[df1.groupby(level=0).binary.bfill().dropna().index].fillna(0).reset_index()
Out[65]: 
      Name     value  binary
0    apple  2016 W16     1.0
1    berry  2016 W16     0.0
2    berry  2016 W17     1.0
3     lime  2016 W16     0.0
4     lime  2016 W17     0.0
5     lime  2016 W18     0.0
6     lime  2016 W19     1.0
7    melon  2016 W16     0.0
8    melon  2016 W17     0.0
9    melon  2016 W18     0.0
10   melon  2016 W19     0.0
11   melon  2016 W20     1.0
12  orange  2016 W16     0.0
13  orange  2016 W17     1.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I'm getting the error `ValueError: cannot reindex from a duplicate axis`. Do you know what that's from? – JAG2024 Dec 07 '17 at 05:06
  • @JAG2024 check your `start_end_weeks`, does it have duplicate ? – BENY Dec 07 '17 at 05:07
  • There are only duplicate values in the `value` column of `df`. – JAG2024 Dec 07 '17 at 05:13
  • The actual values in `reindex(['2016...` not linear. Like they jump from `2016 W 19` to `2016 W 25` if that makes a difference. @Wen – JAG2024 Dec 07 '17 at 05:15
  • @JAG2024 if my second part does work for you , you can using the first part convert it to the datetime – BENY Dec 07 '17 at 05:17
  • Thanks @Wen. I got it to mostly work. I will post a new question for the part I can not get to work. – JAG2024 Dec 07 '17 at 05:33
  • Perhaps you can help with the next version of this problem: https://stackoverflow.com/questions/47699464/reindex-pandas-dataframe-based-on-uneven-dates-and-then-groupby-and-blank-fill-c @Wen – JAG2024 Dec 07 '17 at 16:25