5

Let's assume I have a DataFrame with one single column of data. For example:

np.random.random_integers(0,2,1000)
data = np.cumsum(np.random.random_integers(0,2,1000))
idx = pd.date_range('1-1-2001', freq='D', periods=1000)
df = pd.DataFrame(data, idx)

Instead of working with the complete DataFrame I want to return only those rows which differ from the previous row.

Hence, this

2001-01-20   21
2001-01-21   21
2001-01-22   21
2001-01-23   23
2001-01-24   24
2001-01-25   24

would result in this

2001-01-20   21
2001-01-23   23
2001-01-24   24

Right now I would do it this way

dff = df.diff() # Compute another Series with the differences
dff.ix[0, ] = df.ix[0, ] # Instead of NAN for the row use first row of df
df['diff'] = dff # Add as column in df
df = df[df['diff'] >= 1] # Filter out 
df = df.ix[:, 0:-1] # Drop additional column

This seems awfully complicated. I feel like I am missing something. Any ideas how to make it more pythonic and panda-esque?

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
Joachim
  • 3,210
  • 4
  • 28
  • 43
  • What's wrong with `df.drop_duplicates()`? also your code doesn't run, where is `cumsum` defined? – EdChum Aug 04 '15 at 09:18
  • 1
    Okay, right in this example df.drop_duplicates would work but if I would have a periodic signal. Sinus style for example. I would miss changes in this case. – Joachim Aug 04 '15 at 09:23
  • Sorry, can you explain what you mean with sample code and desired output, it's really unclear to me – EdChum Aug 04 '15 at 09:24
  • So you want to filter rows which differ by more than 1? – EdChum Aug 04 '15 at 09:26
  • is your question the same as this: http://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates? – EdChum Aug 04 '15 at 09:30
  • `df pd.DataFrame([.0,.0,.1,.1,.0,.2,.0],pd.date_range(start='2001-1-1', freq='D', periods=7))` and `df.drop_duplicates()` would result in a 3 row DataFrame but there are 5 changes ... – Joachim Aug 04 '15 at 09:30

1 Answers1

9

You could compare the previous and current rows using .shift() and then index the DataFrame using the corresponding boolean Series:

df.loc[df['a'] != df['a'].shift()]

(I've assumed that your column is called 'a').

.shift() just moves the values in a column/Series up or down by a specified number of places (the default is 1 down).

Alex Riley
  • 169,130
  • 45
  • 262
  • 238