Pandas: using shift to dataframe

Question

I have dataframe

id     event_path
111    google.com
111    yandex.ru
111     vk.com
222     twitter.com
222     twitter.com
333     twitter.com
333     facebook.com

Desire output

id     event_path
111    google.com
111    yandex.ru
111     vk.com
222     twitter.com
333     twitter.com
333     facebook.com

I try to use shift to column

df.loc[(df.event_path != df.event_path.shift()) & \
       (df.id == df.id.shift())]

and it returns me

id     event_path
111    google.com
111    yandex.ru
111     vk.com
222     twitter.com
333     facebook.com

How can I fix that?

What are you trying to achieve here, dropping duplicates or consecutive duplicates? If the latter then this is a dupe of this: https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates — EdChum, Nov 16 '17 at 10:47
@EdChum I need to get data like 111 -> google.com, yandex.ru, vk.com; 222 -> twitter.com; 333 -> twitter.com -> facebook.com . I need union duplicate urls in the path of user — Petr Petrov, Nov 16 '17 at 10:49
Your question is unclear, can you post a better explanation in your question as everyone is confused here — EdChum, Nov 16 '17 at 10:51

piRSquared · Answer 1 · 2017-11-16T10:49:04.660

3

Use pd.DataFrame.drop_duplicates

df.drop_duplicates()

    id    event_path
0  111    google.com
1  111     yandex.ru
2  111        vk.com
3  222   twitter.com
5  333   twitter.com
6  333  facebook.com

IIUC: OP wants to remove only when duplicate is adjacent.

df[df.eq(df.shift().bfill()).any(1)]

    id    event_path
0  111    google.com
1  111     yandex.ru
2  111        vk.com
4  222   twitter.com
5  333   twitter.com
6  333  facebook.com

edited Nov 16 '17 at 10:49

answered Nov 16 '17 at 10:37

piRSquared

285,575
57
475
624

thank you! But can you show me a decision with `shift()` ? Sometimes there are some columns, that I don't should drop duplicates – Petr Petrov Nov 16 '17 at 10:44
Construct example input and desired output that demonstrates what you're asking. – piRSquared Nov 16 '17 at 10:45
I've updated my post. Beyond that, you'll have to put in some extra work explaining what you want. – piRSquared Nov 16 '17 at 10:50

score 1 · Accepted Answer · answered Nov 29 '17 at 14:58

1

You can create helper Series for consecutives values with shift, add column id and get duplicated. Last filter out by boolean indexing:

df1=df[~df[['id']].join(df['event_path'].ne(df['event_path'].shift()).cumsum()).duplicated()]

answered Nov 29 '17 at 14:58

jezrael

822,522
95
1,334
1,252

Pandas: using shift to dataframe

2 Answers2