3

Let's say we have three datasets with three different years:

ID Text Year
101 abc 1990
102 abd 1990
103 a 1990

And the second dataset that could (or not) contain the IDs from the first year:

ID Text Year
104 bc 1991
101 abc 1991
102 abe 1991

And the third dataset:

ID Text Year
104 bc 1992
105 a 1992

I want somehow to merge these three dataframes + add a new column to see if the text between the common IDs (and uncommon IDs) from those consecutive years is the same or not + also update the year if we notice there's similarity between text from Year1 and Year2.

Here is the result I want:

ID Text Year Similar
101 abc 1991 true
102 abe 1991 false
103 a 1990 false
104 bc 1992 true
105 a 1992 false

So I also want to include those new IDs corresponding to the new years, but also to keep the IDs of the previous year, but without having a match in terms of ID + that similar column. The merge should not be inner (because we also want to integrate those IDs that are not present in the second/third dataframe) and the year should be updated if the Similar column is true (let's say if text from ID 104, year 1991 is the same as the text from ID 104, year 1992).

Thanks

stephsmith
  • 171
  • 5

2 Answers2

2

You can use a groupby.agg with lambda g: g.nunique()!=len(g) (if there at least one non duplicate) or lambda g: g.duplicated(keep=False).all() for the column Similar:

dfs = [df1, df2, df3]

out = (pd.concat(dfs)
         .groupby('ID', as_index=False)
         .agg(**{'Text': ('Text', 'last'),
                 'Year': ('Year', 'last'),
                 'Similar': ('Text', lambda g: g.duplicated(keep=False).all())
                })
      )

Output:

    ID Text  Year  Similar
0  101  abc  1991     True
1  102  abe  1991    False
2  103    a  1990    False
3  104   bc  1992     True
4  105    a  1992    False
mozway
  • 194,879
  • 13
  • 39
  • 75
  • What if we have another dataset with ID 104 and different text ("bd")? The similar should become False, not stay True. – stephsmith Dec 15 '22 at 09:28
  • 1
    @stephsmith I believe it does with `lambda g: g.duplicated(keep=False).all()` – mozway Dec 15 '22 at 09:34
  • Yes, it works when using that condition with g.duplicated(keep=False).all(). Many thanks. And what if I want to add another column for "summing" the duplicates found? Let's say for 101, we have 1 duplicate, for 104, 1 duplicate again.. And if we have another dataframe with ID 104 and "bc", to have 2 duplicates instead of 1. What to add to the existing aggregation? – stephsmith Dec 15 '22 at 09:36
  • 1
    add `'count': ('Text', lambda g: g.duplicated().sum())` – mozway Dec 15 '22 at 09:39
0

So, here I am first merging the all df using pandas concat function, in second stage group the id and third applying the lambda function and filtering the last position by -1.

Try following code:

pd.concat([df1,df2,df3]).groupby('id').apply(lambda x: x.iloc[[-1]]).reset_index(drop=True)
R. Baraiya
  • 1,490
  • 1
  • 4
  • 17
  • Could you also help by telling me how to add a new column for "Similar" to see if the text is similar by years? As in the example result – stephsmith Dec 15 '22 at 09:18