1

I have dataframe

id     event_path
111    google.com
111    yandex.ru
111     vk.com
222     twitter.com
222     twitter.com
333     twitter.com
333     facebook.com

Desire output

id     event_path
111    google.com
111    yandex.ru
111     vk.com
222     twitter.com
333     twitter.com
333     facebook.com

I try to use shift to column

df.loc[(df.event_path != df.event_path.shift()) & \
       (df.id == df.id.shift())]

and it returns me

id     event_path
111    google.com
111    yandex.ru
111     vk.com
222     twitter.com
333     facebook.com

How can I fix that?

Petr Petrov
  • 4,090
  • 10
  • 31
  • 68
  • What are you trying to achieve here, dropping duplicates or consecutive duplicates? If the latter then this is a dupe of this: https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates – EdChum Nov 16 '17 at 10:47
  • @EdChum I need to get data like 111 -> google.com, yandex.ru, vk.com; 222 -> twitter.com; 333 -> twitter.com -> facebook.com . I need union duplicate urls in the path of user – Petr Petrov Nov 16 '17 at 10:49
  • Your question is unclear, can you post a better explanation in your question as everyone is confused here – EdChum Nov 16 '17 at 10:51

2 Answers2

3

Use pd.DataFrame.drop_duplicates

df.drop_duplicates()

    id    event_path
0  111    google.com
1  111     yandex.ru
2  111        vk.com
3  222   twitter.com
5  333   twitter.com
6  333  facebook.com

IIUC: OP wants to remove only when duplicate is adjacent.

df[df.eq(df.shift().bfill()).any(1)]

    id    event_path
0  111    google.com
1  111     yandex.ru
2  111        vk.com
4  222   twitter.com
5  333   twitter.com
6  333  facebook.com
piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

You can create helper Series for consecutives values with shift, add column id and get duplicated. Last filter out by boolean indexing:

df1=df[~df[['id']].join(df['event_path'].ne(df['event_path'].shift()).cumsum()).duplicated()]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252