1

I need to analyse the final 60 days up to the last date each user was active.

My dataframe contains the dates ('CalendarDate') that each user ('DataSourceId') was active ('Activity' an integer) - one row per date. I have grouped the dataframe by DataSourceId so I have dates in the columns and I have grabbed the last day each user was active 'max_date':

df['max_date'] = df.groupby('DataSourceId')['CalendarDate'].transform('max')

The data look something like this although 'CalendarDate' and 'max_date' are actually datetime64[ns] format (the Activity values are float64):

ID    Jan1    Jan2    Jan3    Jan4    Jan5...  max_date
1               8              15      10        Jan5
2       2              13                        Jan3
3       6      11                                Jan2

Now, I want to realign the columns from calendar dates to "last x days" for each row. Like this:

ID    Last    Last-1    Last-2    Last-3  ...  Last-x
1      10       15                   8  
2      13                  2           
3      11        6

I have not been able to find any examples of similar transformations and am really stranded here.

EDITED: After adapting jezrael's solution I noticed it failed on occassion.

I think the problem is related to this code in jezrael's solution: r = data_wide.bfill().isna().sum(axis=1).values

Example: This data fails (and r = [0 3]):

CalendarDate                         2017-07-02 2017-07-03 2017-07-06 2017-07-07 2017-07-08 2017-07-09
DataSourceId                                                                                          
1000648                                     NaN     188.37     178.37        NaN     128.37      18.37
1004507                                   51.19        NaN      52.19      53.19        NaN        NaN

Specifically the realigned dataframe looks like this:

              Last-0  Last-1  Last-2  Last-3  Last-4  Last-5
DataSourceId                                                
1000648        18.37  128.37     NaN  178.37  188.37     NaN
1004507        52.19     NaN   51.19     NaN     NaN   53.19

If I change the order in the dataframe by changing ID 1000648 to 1100648 (so that it becomes the second row) this is the result (r = [0 2]):

              Last-0  Last-1  Last-2  Last-3  Last-4  Last-5
DataSourceId                                                
1004507          NaN     NaN   53.19   52.19     NaN   51.19
1100648          NaN  178.37  188.37     NaN   18.37  128.37

3 Answers3

0

Please try the below code and let me know if this helps.

df = df.iloc[:,list(range(len(df.columns)-1,0,-1))]
print(df)
Adrish
  • 52
  • 1
  • 7
  • Thanks Adrish, but this seems to only reverse the calendardate order globally - not taking into account last active date for each user. – Mads Stenbjerre Sep 13 '18 at 12:02
0

You can use this code first finding last continuous null values and with the count shift each series,it will work.

df1 = df[df.columns.difference(['ID'])]
df1 = df1.apply(lambda x:x.shift(x[::-1].isnull().cumprod().sum())[::-1],axis=1)
df1.columns = ['Last-'+str(i) for i in range(df1.columns.shape[0])]
df1['ID'] = df['ID']

Out:

   Last-0   Last-1  Last-2  Last-3  Last-4  ID
0   10.0    15.0    NaN     8.0     NaN     1
1   13.0    NaN     2.0     NaN     NaN     2
2   11.0    6.0     NaN     NaN     NaN     3
Naga kiran
  • 4,528
  • 1
  • 17
  • 31
  • I get a "keyerror" on the last line. Any suggestions? – Mads Stenbjerre Sep 13 '18 at 12:03
  • last line is to assign ID column to second Dataframe, you might have assigned first dataframe to "df" (which leasds to deletion of column) , plz check the execution of first line and assign it to df1. (or check any column name mismatch – Naga kiran Sep 13 '18 at 12:12
  • I get that. Here's my code (my original df is called "data_wide"): `df = data_wide[data_wide.columns.difference(['DataSourceId'])]` `df = df.apply(lambda x:x.shift(x[::-1].isnull().cumprod().sum())[::-1],axis=1)` `df.columns = ['Last-'+str(i) for i in range(df.columns.shape[0])]` `df['DataSourceId'] = data_wide['DataSourceId']` – Mads Stenbjerre Sep 13 '18 at 12:21
0

If performance is important, use a bit changed numpy solution:

#select all columns without last
A = df.iloc[:, 1:-1].values
print (A)
[[nan  8. nan 15. 10.]
 [ 2. nan 13. nan nan]
 [ 6. 11. nan nan nan]]

#count NaNs values
r = df.bfill(axis=1).isna().sum(axis=1).values
#oldier pandas versions
#r = df.bfill(axis=1).isnull().sum(axis=1).values
#boost solution by https://stackoverflow.com/a/30428192
#r = A.shape[1] - (~np.isnan(A)).cumsum(axis=1).argmax(axis=1) - 1
print (r)
[0 2 3]

rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]

# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = np.flip(column_indices - r[:,np.newaxis], axis=1)
print (column_indices)
[[ 4  3  2  1  0]
 [ 2  1  0 -1 -2]
 [ 1  0 -1 -2 -3]]

result = A[rows, column_indices]
#https://stackoverflow.com/a/51613442
#result = strided_indexing_roll(A,r)
print (result)
[[10. 15. nan  8. nan]
 [13. nan  2. nan nan]
 [11.  6. nan nan nan]]

c = [f'Last-{x}' for x in np.arange(result.shape[1])]
df1 = pd.DataFrame(result, columns=c)
df1.insert(0, 'ID', df['ID'])
print (df1)
   ID  Last-0  Last-1  Last-2  Last-3  Last-4
0   1    10.0    15.0     NaN     8.0     NaN
1   2    13.0     NaN     2.0     NaN     NaN
2   3    11.0     6.0     NaN     NaN     NaN

EDIT:

If ID is index then solution is a bit changed - not remove first column by .iloc[:, :-1] and last use DataFrame contructor only:

A = df.iloc[:, :-1].values
print (A)
[[nan  8. nan 15. 10.]
 [ 2. nan 13. nan nan]
 [ 6. 11. nan nan nan]]

r = df.bfill(axis=1).isna().sum(axis=1).values
print (r)
[0 2 3]

rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]

# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = np.flip(column_indices - r[:,np.newaxis], axis=1)
print (column_indices)
[[ 4  3  2  1  0]
 [ 2  1  0 -1 -2]
 [ 1  0 -1 -2 -3]]

result = A[rows, column_indices]
print (result)
[[10. 15. nan  8. nan]
 [13. nan  2. nan nan]
 [11.  6. nan nan nan]]

c = [f'Last-{x}' for x in np.arange(result.shape[1])]
#use DataFrame constructor
df1 = pd.DataFrame(result, columns=c, index=df.index)
print (df1)
    Last-0  Last-1  Last-2  Last-3  Last-4
ID                                        
1     10.0    15.0     NaN     8.0     NaN
2     13.0     NaN     2.0     NaN     NaN
3     11.0     6.0     NaN     NaN     NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @MadsStenbjerre - Do you think `df1.insert(0, 'ID', df['ID'])` ? – jezrael Sep 13 '18 at 12:21
  • Yes. Sorry. See my similar comment to Naga above. – Mads Stenbjerre Sep 13 '18 at 12:25
  • If you're still there: After running the solution for a bit I have noticed that if there are varying numbers of columns with data the method does not work (unless the row with the lowest ID has the max number of columns with data). I think the issue is this line `r = data_wide.bfill().isna().sum(axis=1).values`. Do you have any suggestion how to alter this code to reflect `[ 0 max]` rather than the dimensions of the first row? – Mads Stenbjerre Sep 20 '18 at 09:46
  • @MadsStenbjerre - Can you create sample data with problematic rows? – jezrael Sep 20 '18 at 10:23
  • Not sure I am using the right terms, but I think the resulting vector r is ordered by counts of NA values [0 1 2 2 2 3 3 4 etc ] and not relative to the df index (IDs). So only if the df IDs are already ordered from least NA values to most NA values it will produce misaligned output. – Mads Stenbjerre Sep 20 '18 at 11:35
  • @MadsStenbjerre - Sorry, I dont understand your data in edited answer. Why it depends of ID? Can you explain more? Why is second row order different? – jezrael Sep 20 '18 at 11:40
  • Not sure. But there is something going on dependent on row order (see new example data and result - closer to my real data - in original question). The result for `r` changes but first value is always 0. BTW I no longer have a last column with `max_date` so I changed A to `A = data_wide.iloc[:, :].values` – Mads Stenbjerre Sep 20 '18 at 12:01
  • 1
    @MadsStenbjerre - You are absolutely rigrt, there was copy error. Neech change `bfill` to `bfill(axis=1)` because need replace backfiling per columns. Sorry ;( – jezrael Sep 20 '18 at 12:34
  • 1
    Perfect. Works now. Thanks. – Mads Stenbjerre Sep 20 '18 at 12:43