0

Set-up

I have two pandas data frames df1 and df2, each containing two columns with observations for id and its respective url,

| id | url |          | id | url | 
------------          ------------
| 1  | url |          | 2  | url |
| 2  | url |          | 4  | url |
| 3  | url |          | 3  | url |
| 4  | url |          | 5  | url |
                      | 6  | url |

Some observations are in both dfs, which is clear from the id column, e.g. observation 2 and it's url are in both dfs.

The positioning within the dfs of those 'double' observations does not necessarily have to be the same, e.g. observation 2 is in first row in df1 and second in df2.

Lastly, the dfs do not necessarily have the same number of observations, e.g. df1 has four observations while df2 has five.


Problem

I want to elicit all unique observations in df2 and insert them in a new df (df3), i.e. I want to obtain,

| id | url |
------------
| 5  | url |
| 6  | url |

How do I go about?

I've tried this answer but cannot get it to work for my two-column dataframes.

I've also tried this other answer, but this gives me an empty common dataframe.

Bonifacio2
  • 3,405
  • 6
  • 34
  • 54
LucSpan
  • 1,831
  • 6
  • 31
  • 66
  • 1
    Are you after this: https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe? if so then it's a dupe – EdChum Jul 12 '17 at 08:30
  • Thank you, but I cannot get that to work, see my question. – LucSpan Jul 12 '17 at 08:35
  • Please edit your post with your attempts from that question, stating that it doesn't work is not informative. I believe it would work but you need to prove it doesn't, also the answer from Greg should work, if it does then it's a dupe, if it doesn't then demonstrate this – EdChum Jul 12 '17 at 08:59

2 Answers2

1

Possibly something like this: df3 = df2[~df2.id.isin(df1.id.tolist())]

Greg
  • 101
  • 1
  • 10
1

ID numbers make good index names:

df1.index = df1.id
df2.index = df2.id

Then use the very straightforward index.difference:

diff_index = df2.index.difference(df1.index)
df3 = df2.loc[diff_index]