1

This is part of a larger project, but I've broken my problem down into steps, so here's the first step. Take a Pandas dataframe, like this:

index | user   time     
---------------------
 0      F       0   
 1      T       0   
 2      T       0   
 3      T       1   
 4      B       1 
 5      K       2 
 6      J       2 
 7      T       3 
 8      J       4 
 9      B       4 

For each unique user, can I extract the difference between the values in column "time," but with some conditions?

So, for example, there are two instances of user J, and the "time" difference between these two instances is 2. Can I extract the difference, 2, between these two rows? Then if that user appears again, extract the difference between that row and the previous appearance of that user in the dataframe?

TJE
  • 570
  • 1
  • 5
  • 20

2 Answers2

5

I believe need DataFrameGroupBy.diff:

df['new'] = df.groupby('user')['time'].diff()
print (df)
  user  time  new
0    F     0  NaN
1    T     0  NaN
2    T     0  0.0
3    T     1  1.0
4    B     1  NaN
5    K     2  NaN
6    J     2  NaN
7    T     3  2.0
8    J     4  2.0
9    B     4  3.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hi @jezrael. I wonder if you could answer a follow-up to this question. I'm trying to do this: if the 'new' column = NaN, that counts as a starting time column for that 'user' id. Then that user 'T' appears 3 more times including on row 2 and 3 and 7. So these 4 rows for 'T' should be grouped into a single row with user = T, a new column 'start_time' = 0 for the earliest time, another new column 'end_time' = 3 (from row 7), and a third new column 'count' = 4 to indicate that this user appears 4 times. So the row for this user would be 'user' = T, 'start_time' = 0, 'end_time' = 3, 'count' = 4. – TJE Jun 10 '18 at 04:52
  • 1
    Use `df = df.groupby('user')['time'].agg([('start_time','first'),('end_time','last'), ('count','size')]).reset_index()` – jezrael Jun 10 '18 at 05:04
  • Yes, that works. What about this then: perform the above for a 'user' only if the difference between each time is less than or equal to 2. Else -- for example, user B -- if the time difference between one appearance of the user (row 4) and the next appearance of that user (row 9) is > 2, do not group those rows together. Instead, because the difference between row 4 and row 9 for 'user' B is 2, row 4 would produce its own row as 'user' = B, 'start_time' = 1, 'end_time' = 1, 'count' = 1 -- and then row 9 would produce 'user' = B, 'start_time' = 4, 'end_time' = 4, 'count' = 1. – TJE Jun 10 '18 at 05:35
  • 1
    @TJE - I think need add new `Series` - `df = df.groupby(['user', df['new'].gt(2).cumsum()])['time'].agg([('start_time','first'),('end_time','last'), ('count','size')]).reset_index()` – jezrael Jun 10 '18 at 05:46
  • Any advice for how to better learn Pandas? Do you just study Pandas documentation or is there a book or class you recommend? – TJE Jun 10 '18 at 05:55
  • 1
    @TJE - Hard question, For me help code a lot and answering in StackOveflow. Also I think in docs are nice tutorials, especially [modern-pandas](http://pandas.pydata.org/pandas-docs/stable/tutorials.html#modern-pandas) – jezrael Jun 10 '18 at 05:58
  • 1
    Very much appreciated. I will check out modern-pandas. – TJE Jun 10 '18 at 06:06
0

I think np.where and pandas shifts does this This subtract between two consecutive Time, only if the users are same

df1 = np.where (df['users'] == df['users'].shifts(-1), df['time'] - df['time'].shifts(-1), 'NaN')
user96564
  • 1,578
  • 5
  • 24
  • 42