0

This post has been really helpful for getting the basis of what I want to do, however, I'm stuck with how to get to the finish line.

I have large dataframe (approx. 10k rows) with the first few rows looking like what I'll call df_a:

zone  | value   
0     | 12
1     | 12
2       99
3       12
0       12
1       12
2       12
3       99

I am looking to drop consecutive duplicates within 'value', however, based on the condition of zone. For example, in the above snippet I would want the second '12' to be dropped for zone = 1. So that I end up with:

zone  | value   
0     | 12
1     | 12
2       99
3       12
2       12
3       99

My initial idea was to use a loop across a list of zones, create new variables for each created zone automatically based on the zone name, and the run my drop duplicates code (based on this answer. However, this doesn't work:

data_category_range = df_a['zone'].unique()
data_category_range = data_category_range.tolist()

for i,value in enumerate(data_category_range):
    data_category_range['zone_{}'.format(i)] = df_a[df_a['zone'] == value]

   # de-duplicate
   cols = ["zone","value"]
   de_dup = df_a[cols].loc[(df_a[cols].shift() != df_a[cols]).any(axis=1)]

(This loop is within another loop which will iterate across dataframes with different 'zone' values, so variable needs to be dynamic - open to alternatives as I understand this isnt best practice).

Thanks!

danwri
  • 193
  • 11

1 Answers1

0

You can use drop_duplicates

import pandas as pd

data = pd.DataFrame(
    {"zone": [0, 1, 2, 3, 0, 1, 2, 3], "value": [12, 12, 99, 12, 12, 12, 12, 99]}
)
data.drop_duplicates(["zone", "value"])

This will give you

|    |   zone |   value |
|---:|-------:|--------:|
|  0 |      0 |      12 |
|  1 |      1 |      12 |
|  2 |      2 |      99 |
|  3 |      3 |      12 |
|  6 |      2 |      12 |
|  7 |      3 |      99 |
ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65
  • Apologies, a poorly worded question by me as the actual datasets are much larger (10k+ rows) and I understand the proposed `drop_duplicates` will only keep unique values? While I am aiming to remove **consecutive** duplicates (in my dataset, this represents no change, but it is still logged). Thanks! – danwri Dec 19 '19 at 17:14