-1

I have two dataframes as per below:

df1 = [['tom', 10],['nick',15], ['juli',14]]
df2= [['juli', 14],['daniel',15], ['tom',10], ['tom',10]]

Please note that the dataframes might not have the same index order, but I would like to receive the differences between them as an output.

So I would expect the output to be:

['nick',15]
['daniel',15]
['tom',10]

If possible, I would like to know which data frame contains each value. In the example provided above, the first item (['nick',15]) belongs to df1 and the others to df2.

Bonus: Is there a way to export the output to Excel?

danielssl
  • 29
  • 4
  • Are these lists of lists or DataFrames? – not_speshal Oct 11 '21 at 19:06
  • 1
    Why is tom 10 in this result? Do duplicate rows in the same table count as separate? Also related [Python Pandas - Find difference between two data frames](https://stackoverflow.com/q/48647534/15497888) – Henry Ecker Oct 11 '21 at 19:07
  • ['tom', 10] should be on the result because there are 2x ['tom', 10] on df2 vs. only one of df1 – danielssl Oct 11 '21 at 19:09
  • @danielssl is [this](https://stackoverflow.com/a/69531328/16343464) what you're looking for? – mozway Oct 11 '21 at 19:15
  • @mozway Tried your answer, but not working properly... It returns an error to me – danielssl Oct 11 '21 at 19:18
  • @ mozway output = (df1.assign(idx=df1.groupby(cols).cumcount()).merge(df2.assign(idx=df1.groupby(cols).cumcount()),on=list(df1.columns+['idx'],indicator=True, how='outer').drop('idx',axis=1).query('_merge != "both"'))) TypeError: list() takes no keyword arguments – danielssl Oct 11 '21 at 19:19
  • Well you need to use dataframes as input, you said those were dataframes, not lists. Check the update – mozway Oct 11 '21 at 19:19

1 Answers1

-1

Assuming this input:

df1 = pd.DataFrame([['tom', 10],['nick',15], ['juli',14]])
df2 = pd.DataFrame([['juli', 14],['daniel',15], ['tom',10], ['tom',10]])

You could use merge with the indicator option.

The rationale here is to create an additional column with an index per group to identify the duplicates.

cols = list(df1.columns)
(df1.assign(idx=df1.groupby(cols).cumcount())
    .merge(df2.assign(idx=df2.groupby(cols).cumcount()),
           on=list(df1.columns)+['idx'],
           indicator=True,
           how='outer')
    .drop('idx', axis=1)
    .query('_merge != "both"')
   #.to_excel('output.xlsx')  ## uncomment to export as xlsx
)

output:

        0   1      _merge
1    nick  15   left_only
3  daniel  15  right_only
4     tom  10  right_only
mozway
  • 194,879
  • 13
  • 39
  • 75