1

I have three datasets having the same columns name but different data.

Dataset 1

Date              Full Name           Texts         Country            Read
02/05/2017      Jack Smith       Grey ranks...        UK               True  
02/05/2017      Mary Jane        Once upon a time    France            False
...
24/06/2019     Luke Martyn       Bottled...          USA               True

Similarly for the other dataset 2

Date              Full Name           Texts         Country            Read
02/05/2017      Jack Smith       The story ...        UK               True  
04/05/2017      Mary Jane        She saw ...         France            False
...
12/03/2019     Chris Martyn       From Mexico ...     Mexico           False

and the dataset 3.

Date              Full Name           Texts         Country            Read
23/02/2017      Jack Smith          They said  ...        UK               True  
01/05/2017      Martin O'Connell    When I was young ...  Ireland          True
...
14/07/2018     Laura Cross          My family ...         Italy            True

I would like to get the following information from these datasets:

  • unique/distinct names, i.e. names that are not included in one or more datasets (but in this case I would like to know if it would be possible to know in which dataset they are included/missing)
  • common names, i.e. names that are in all the three datasets or in two of them (same as above).

Some people are in all three datasets; some others are not. Of course, if people are the same, I have same information regarding the Country, but different texts. I should probably join the three datasets to see which unique names are in all the three datasets and where they differ. And, yes, probably it would be necessary to create a new column to get information about the dataset that include or it does not the name.

I know how to join the three datasets without any condition. However, I would need to get the information above based on a condition, for example when Read = True.

halfer
  • 19,824
  • 17
  • 99
  • 186
  • Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – Umar.H May 24 '20 at 23:42
  • @Datanovice, I already had a look at it and it is extremely useful. But I would like to know how to join the datasets based on a condition. –  May 24 '20 at 23:51

1 Answers1

0

IIUC, you can use concat with keys to list your source dataframe, then just simple named aggregation groupby .agg

df = (
    pd.concat([df1, df2, df3], keys=["df1", "df2", "df3"])
    .reset_index(0)
    .rename(columns={"level_0": "key"})
)


df.groupby(['Full Name']).agg(name_occurance=('key','count'),
                             source=('key',','.join))

                  name_occurance       source
Full Name                                    
Chris Martyn                   1          df2
Jack Smith                     3  df1,df2,df3
Laura Cross                    1          df3
Luke Martyn                    1          df1
Martin O'Connell               1          df3
Mary Jane                      2      df1,df2
Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • thank you so much @Datanovice. About the condition (Read= True), is it already included in your code or I should set it? –  May 25 '20 at 00:34
  • 1
    @Math do you mean `df[df['Read'] == 'True'].groupby(['Full Name']).agg(name_occurance=('key','count'), source=('key',','.join))` – Umar.H May 25 '20 at 00:42