1

I have a time-series dataset and I need to find the difference in my field for every tick.

In excel, I can accomplish this easily by writing a formula B2=A2-A1 and copying this formula down the entire B column

example:

enter image description here

Is there a straightforward way to do this with pandas?

I have considered apply() but looks like it can only work on 1 element of a series

My other option will be to create an Numpy array out of my column A and run a loop through it. However, that seems like a roundabout way and gets even more complicated if I need to refer elements from different

The actual Excel formula that I need to translate is =IF((A3-A2)>0, (A3-A2), A3+(4294967296-A2))

KSp
  • 1,199
  • 1
  • 11
  • 29
user2251346
  • 620
  • 5
  • 14
  • Apply will work. See this answer: https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns – Cole Howard Apr 26 '18 at 04:00
  • This answer explains a way to look at multiple columns one element at a time. I want to look at the same single column but two elements at a time. Am I missing something? – user2251346 Apr 26 '18 at 06:44

3 Answers3

2

This may help. I haven't tested: for this, please provide some data (in text form) with desired output.

df[1] = np.where(df[0].diff() > 0, df[0].diff(), df[0].shift() + 2**32 - df[0])
jpp
  • 159,742
  • 34
  • 281
  • 339
1

To obtain the lagged difference of a series use

df['my_column'].diff()

You can specify the lag as well, if you wanted to do something different than 1.

See the documentation for more info


EDIT Here is a possible way to address your formula:

df = pd.DataFrame({'A': [20, 22, 25, 43, 23, 45, 67, 50, 70]})
df['result'] = df['A'].diff()
df['result'][ df['result'] < 0 ] += 2**32  # you can disregard the pandas warning

Based on your logic you can simply take the diff and if the result is less than zero then add the constant 2**32. And it should be a little more efficient than taking the diff a few times (although probably this is not a big concern).

WillMonge
  • 1,005
  • 8
  • 19
  • That's great! However my real need is a little more complicated. I am trying to count bytes sent over a tcp connection every minute by calculating the difference in the seq number. The sequence number could roll over. So my excel formula really looked like =IF((A3-A2)>0, (A3-A2), A3+(4294967296-A2)) – user2251346 Apr 26 '18 at 06:03
  • 1
    @user2251346 see my edit for addressing your following formula. I believe it should be fairly more readable and efficient than the `np.where` approach – WillMonge Apr 26 '18 at 18:02
  • Can you explain why you think it will be more efficient? – user2251346 Apr 27 '18 at 15:51
  • 1
    @user2251346: Sure, jpp's answer is a good one liner, however it performs 2 diff's and one shift, since it will basically create both arrays for the `np.where` (the when true's and the when false's). If the diff and shift operations were costly, due to for example the Series being too long, then the one-liner would be less efficient. I hope my two lines of code are also more readable, but that is more subjective. Hope that was helpful! – WillMonge Apr 27 '18 at 18:40
  • Got it. Thanks! – user2251346 Apr 27 '18 at 21:45
0

This is how I solved my problem for now using numpy

(translating Excel formula =IF((A3-A2)>0, (A3-A2), A3+(2**32-A2)))

s1=np.array(df[0])
s=s1[1:]-s1[0:-1]
s[s<0]=s[s<0]+(2**32)

I do feel there can be a more elegant solution for this natively in pandas. May be if df.diff() could take a lambda parameter?

user2251346
  • 620
  • 5
  • 14