0

I have a pandas data frame with the next columns:

User_id  (numeric)|  Day  (DateTime)|  Data  (numeric)

What I want is to group by user_id so that I keep just those users for who I have Data in a period of 15 consecutive days.

Say, if I had data from 01-05 (dd-mm) to 16-05 (dd-mm) the rows referring to that user would be kept.

Ex.:

df1 = pd.DataFrame(['13-01-2018',1], ['14-01-2018',2],['15-01-2018',3],
        ['13-02-2018',1], ['14-02-2018',2],['15-02-2018',3])

#Apply solution to extract data of first N consecutive dates with N = 3

result.head()

    0                1  
0   13-01-2018       1
1   14-01-2018       2
2   15-01-2018       3

Don't be afraid to ask for further details! Sorry I couldn't be more specific.

Alpha Beta
  • 31
  • 1
  • 4
  • 4
    But you can be more specific, please see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). You can give your example input and the expected output. It doesn't have to be your real data. – roganjosh May 18 '18 at 19:32

1 Answers1

0

I finally managed to solved it.

At first I thought my solutions wouldn't be optimal and would take too much time, but it turned out to work pretty good.

I defined a function that, given a n_days time window traverses the data looking for a delta fitting the windows size (n_days). Thus, it looks for n_days consecutive dates.

def consecutive(x,n_days):
     result = None   
     if len(x) >= n_days:
         stop = False
         i = n_days
         while(stop == False and i < len(x)+1):
             window = lista[i-n_days:i]
             delta = (window[len(window)-1]-window[0]).n_days
             if delta == n_days-1:
                 stop = True
                 result = window        
             i=i+1
     return result

And then call it by using apply.

b=a.groupby('user_id')['day'].unique().apply(lambda x: consecutive(x,15))
df=b.loc[b.apply(lambda x: x is not None)].reset_index()

Next steps imply to transform the dataframe into a one with a row per returned date.

import itertools
import pandas as pd
import numpy as np

def melt_series(s):
    lengths = s.str.len().values
    flat = [i for i in itertools.chain.from_iterable(s.values.tolist())]
    idx = np.repeat(s.index.values, lengths)
    return pd.Series(flat, idx, name=s.name)
df=melt_series(df.day).to_frame().join(df.drop('day', 1)).reindex_axis(df.columns, 1)

And merging with the actual dataframe.

final =pd.merge(a,df[['user_id','day']],on=['user_id','day'])
Alpha Beta
  • 31
  • 1
  • 4