2

I have a puzzle. This is easy in excel. But, in pandas, with the dataframe df:

   |  EventID  |  PictureID  |  Date
0  |  1        |  A          |  2010-01-01
1  |  2        |  A          |  2010-02-01
2  |  3        |  A          |  2010-02-15
3  |  4        |  B          |  2010-01-01
4  |  5        |  C          |  2010-02-01
5  |  6        |  C          |  2010-02-15

Is there a way to add a new column which counts the number of times the same PictureID has a recorded event in the previous 6 months? In other words, the number of rows in the dataframe with the same PictureID as a given row and with a Date within the six months prior to the date of the given row.

df['PastSix'] = ???

So the output looks something like:

   |  EventID  |  PictureID  |  Date        |  PastSix
0  |  1        |  A          |  2010-01-01  |  0
1  |  2        |  A          |  2010-02-01  |  1
2  |  3        |  A          |  2010-02-15  |  2
3  |  4        |  B          |  2010-01-01  |  0
4  |  5        |  C          |  2010-02-01  |  0
5  |  6        |  C          |  2010-02-15  |  1
user1893148
  • 1,990
  • 3
  • 24
  • 34
  • Do you have occurrences of `PictureID` that are older than six months? – Phillip Cloud Sep 15 '13 at 01:22
  • In reality, I'm working with 5 years of data, so yes. I mean, for each row, to count the number of events/rows with the same PictureID, with a date that falls within the six month period previous to the current row's date. – user1893148 Sep 15 '13 at 01:27
  • Because month is not equal length, what is the define of previous 6 months? 180 days? – HYRY Sep 15 '13 at 12:07

1 Answers1

2

I don't know how to define 6 months, so I use prev 183 days instead, the basic idea is use asof() method:

import pandas as pd
import numpy as np
import io

txt = u"""EventID  |  PictureID  |  Date
0        |  A          |  2009-07-01
1        |  A          |  2010-01-01
2        |  A          |  2010-02-01
3        |  A          |  2010-02-15
4        |  B          |  2010-01-01
5        |  C          |  2010-02-01
6        |  C          |  2010-02-15
7        |  A          |  2010-08-01
"""

df = pd.read_csv(io.StringIO(txt), sep=r"\s*\|\s*", parse_dates=["Date"])

def f(df):
    count = pd.Series(np.arange(1, len(df)+1), index=df["Date"])
    prev1day = count.index.shift(-1, freq="D")
    prev6month = count.index.shift(-183, freq="D")
    result = count.asof(prev1day).fillna(0).values - count.asof(prev6month).fillna(0).values
    return pd.Series(result, df.index)

df["PastSix"] = df.groupby("PictureID").apply(f)
print df

output:

   EventID PictureID                Date  PastSix
0        0         A 2009-07-01 00:00:00        0
1        1         A 2010-01-01 00:00:00        0
2        2         A 2010-02-01 00:00:00        1
3        3         A 2010-02-15 00:00:00        2
4        4         B 2010-01-01 00:00:00        0
5        5         C 2010-02-01 00:00:00        0
6        6         C 2010-02-15 00:00:00        1
7        7         A 2010-08-01 00:00:00        2
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • @HYRY- Can you please give me a hint for this related question to this question https://stackoverflow.com/questions/46196090/how-to-keep-track-of-previous-date-record-column-in-pandas-dataframe – ketan Sep 13 '17 at 12:51