How to outer merge 3 or more datasets based on an id and compare them using one column?

Question

Let's say we have three datasets with three different years:

ID	Text	Year
101	abc	1990
102	abd	1990
103	a	1990

And the second dataset that could (or not) contain the IDs from the first year:

ID	Text	Year
104	bc	1991
101	abc	1991
102	abe	1991

And the third dataset:

ID	Text	Year
104	bc	1992
105	a	1992

I want somehow to merge these three dataframes + add a new column to see if the text between the common IDs (and uncommon IDs) from those consecutive years is the same or not + also update the year if we notice there's similarity between text from Year1 and Year2.

Here is the result I want:

ID	Text	Year	Similar
101	abc	1991	true
102	abe	1991	false
103	a	1990	false
104	bc	1992	true
105	a	1992	false

So I also want to include those new IDs corresponding to the new years, but also to keep the IDs of the previous year, but without having a match in terms of ID + that similar column. The merge should not be inner (because we also want to integrate those IDs that are not present in the second/third dataframe) and the year should be updated if the Similar column is true (let's say if text from ID 104, year 1991 is the same as the text from ID 104, year 1992).

Thanks

https://stackoverflow.com/a/52223263/18286914 – krisskad Dec 15 '22 at 09:09 — krisskad, Dec 15 '22 at 09:09

mozway · Accepted Answer · 2022-12-15T09:35:09.817

2

You can use a groupby.agg with lambda g: g.nunique()!=len(g) (if there at least one non duplicate) or lambda g: g.duplicated(keep=False).all() for the column Similar:

dfs = [df1, df2, df3]

out = (pd.concat(dfs)
         .groupby('ID', as_index=False)
         .agg(**{'Text': ('Text', 'last'),
                 'Year': ('Year', 'last'),
                 'Similar': ('Text', lambda g: g.duplicated(keep=False).all())
                })
      )

Output:

    ID Text  Year  Similar
0  101  abc  1991     True
1  102  abe  1991    False
2  103    a  1990    False
3  104   bc  1992     True
4  105    a  1992    False

edited Dec 15 '22 at 09:35

answered Dec 15 '22 at 09:17

mozway

194,879
13
39
75

What if we have another dataset with ID 104 and different text ("bd")? The similar should become False, not stay True. – stephsmith Dec 15 '22 at 09:28
1

@stephsmith I believe it does with `lambda g: g.duplicated(keep=False).all()` – mozway Dec 15 '22 at 09:34
Yes, it works when using that condition with g.duplicated(keep=False).all(). Many thanks. And what if I want to add another column for "summing" the duplicates found? Let's say for 101, we have 1 duplicate, for 104, 1 duplicate again.. And if we have another dataframe with ID 104 and "bc", to have 2 duplicates instead of 1. What to add to the existing aggregation? – stephsmith Dec 15 '22 at 09:36
1

add `'count': ('Text', lambda g: g.duplicated().sum())` – mozway Dec 15 '22 at 09:39

R. Baraiya · Answer 2 · 2022-12-15T09:21:10.350

0

So, here I am first merging the all df using pandas concat function, in second stage group the id and third applying the lambda function and filtering the last position by -1.

Try following code:

pd.concat([df1,df2,df3]).groupby('id').apply(lambda x: x.iloc[[-1]]).reset_index(drop=True)

edited Dec 15 '22 at 09:21

answered Dec 15 '22 at 09:10

R. Baraiya

1,490
1
4
17

Could you also help by telling me how to add a new column for "Similar" to see if the text is similar by years? As in the example result – stephsmith Dec 15 '22 at 09:18

How to outer merge 3 or more datasets based on an id and compare them using one column?

2 Answers2