Joining two Pandas DataFrames with condition but keep both

Question

I am new here, and this is my first question. Sorry if I am doing this wrong.

I have some DataFrames, and I need to join them based on the existence of the same ID but keeping both lines. For example:

Table A:

ID	pID	Code	Condition
1	10	'bl'	'Normal'
2	11	'bl'	'MCI'
3	12	'bl'	'Normal'
4	15	'bl'	'EMCI'

Table B:

ID	pID	Code	Condition
1	10	'm06'	'Alzheimer'
2	11	'm06'	'MCI'
4	15	'm06'	'Alzheimer'
5	16	'm06'	'Normal'

Table C:

ID	pID	Code	Condition
1	10	'm12'	'Alzheimer'
2	11	'm12'	'MCI'
4	18	'm12'	'Normal'
5	19	'm12'	'MCI'

The result I need with this joins the following:

Result:

ID	pID	Code	Condition
1	10	'bl'	'Normal'
2	10	'm06'	'Alzheimer'
3	10	'm12'	'Alzheimer'
4	11	'bl'	'MCI'
5	11	'm06'	'MCI'
6	11	'm12'	'MCI'

This is just an example, tables do not have the same size, and the field ID is the ID provided by pd.DataFrame. Is this possible to make? Can you help me?

Thank you for the time you took to read my question! Happy programming!

Edit 1:

Thanks for all your answers. @Arkadiusz answer works fine for two tables. The problem is I have multiple tables, which I need to intersect the pID.

@Arkadiusz answer: (Little modification)

Concatenate two dataframes:

out = pd.concat([df_a, df_b]).sort_values('pID')

Leave only values which are duplicated by 'pID' column:

out = out[out['pID'].duplicated(keep=False)].reset_index(drop=True)

To answer some comments, when I say tables have different size, the current dataset I am working with and the one mentioned in this problem, I am referring to the number of rows.

The field ID is purely a field created from pd.DataFrame. It can be discarded.

Edit 2:

The image 1 shows how many tables I am using. VISCODE is what I mean by code, it actually describes when was the baseline visit and visits after that. The point here is to have a dataset where a person (pID) was present on all visits. To see the evolution of the patient.

Number of rows per code

Sorry for so many edits. Again, thank you for your time reading my question!

Edit 3: Just added a third table to make my question more clear. My goal is to maximize the number of patients on a longitudinal study. For this to work, the patient must have been present to all the exams (Code).

When you write `same size`, do you mean different amount of columns or rows? — CutePoison, Jun 23 '21 at 11:28
Thank you for your comments. I just needed to be sorted by that pID. By ´not have the same size´ I mean different number rows and columns — Trovão, Jun 23 '21 at 14:27

Arkadiusz · Accepted Answer · 2021-06-24T09:51:13.643

1

EDIT:

Steps 1 and 2 have been edited after changing the concepts - it will work for multiple dataframes.

Step 1. Concatenate all dataframes:

dfs = [df_a, df_b, df_c]
out = pd.concat(dfs).sort_values('pID')

Step 2. Leave values which appear in all dataframes:

out = out[out.groupby('pID')['ID'].transform('count').eq(len(dfs))].reset_index(drop=True)

Step 3. Change 'ID' series:

out['ID'] = pd.Series(range(len(out['ID']))).add(1)

Output:

   ID   pID   Code    Condition
0   1    10   'bl'     'Normal'
1   2    10  'm06'  'Alzheimer'
2   3    10  'm12'  'Alzheimer'
3   4    11   'bl'        'MCI'
4   5    11  'm06'        'MCI'
5   6    11  'm12'        'MCI'

edited Jun 24 '21 at 09:51

answered Jun 23 '21 at 11:41

Arkadiusz

1,835
1
8
15

Thank you! This solved the problem really well. You are awesome! – Trovão Jun 23 '21 at 14:24
Is this scalable? If I need to add a third and a fourth table maintaining the intersection, is it doable? – Trovão Jun 23 '21 at 15:43
You can concatenate multiple dataframes - https://stackoverflow.com/questions/53877687/how-can-i-concat-multiple-dataframes-in-python If it is not what you mean, you can add third and fourth dataframe to your question and expected output after modifications. – Arkadiusz Jun 24 '21 at 06:30
Thank you for your answer. I added a third table as an example. I do not want to just concatenate, I need to intersect the pIDs on all DataFrames and keep all lines this is happening. Your solution is awesome for two, but when I need to make 3 and 4 it will not work because there are duplicates instead of triplicates and quadruplicates. Sorry for so much trouble! – Trovão Jun 24 '21 at 08:27
I provided a new solution. Let me know if it works properly. I made changes for step 1 and step 2. – Arkadiusz Jun 24 '21 at 09:07
Almost good, instead of mode, is there anything to make it final, for example, if I am using 3 tables, save only the pID that appear 3 times? – Trovão Jun 24 '21 at 09:37
1

Yes, we can create a list of dataframes and take only those elements in final dataframe which are equal to the length of the list. I edited step 1 and step 2. Please check it now. – Arkadiusz Jun 24 '21 at 09:46

score 0 · Answer 2 · answered Jun 23 '21 at 11:32

0

Use concat

# Names of dataframes
frames = [df1, df2, df3] 
# Concat dataframes
result = pd.concat(frames)

answered Jun 23 '21 at 11:32

ksmehta

61
3

I don't know if OP wants it sorted by `pID` aswell i.e `result = result.sort_values(by=["pID"])` ? – CutePoison Jun 23 '21 at 11:34
This article is helpful for joining data-frame in pandas with various conditions. https://www.datacamp.com/community/tutorials/joining-dataframes-pandas – ksmehta Jun 23 '21 at 11:47
Concat is a good solution, problem is that gives me a full join. I only want the fields where pID is common to all the tables. – Trovão Jun 23 '21 at 15:54

score 0 · Answer 3 · answered Jun 23 '21 at 11:36

I am assuming you refer to joining because you only want to keep patients which are present in both dataframes. In you example pID 10 and pID 16 are dropped as well.

Otherwise pd.concat as @ksmehta propsed is far easier.

That being said. You can join them with

In [10]: df = pd.merge(a.drop('ID', axis=1),b.drop('ID', axis=1), on=['pID'])
Out[10]: 
   pID Code_x Condition_x  Code_y  Condition_y
0   10  'bl'     'Normal'  'm06'   'Alzheimer'
1   11  'bl'        'MCI'  'm06'         'MCI'
2   15  'bl'       'EMCI'  'm06'   'Alzheimer'

and then use pd.wide_to_long to combine the Code and Condition fields.

In [20]: pd.wide_to_long(df, stubnames=['Code', 'Condition'], i='pID', j='xy', suffix='_.')
Out[20]: 
          Code    Condition
pID xy                     
10  _x   'bl'      'Normal'
11  _x   'bl'         'MCI'
15  _x   'bl'        'EMCI'
10  _y  'm06'   'Alzheimer'
11  _y  'm06'         'MCI'
15  _y  'm06'   'Alzheimer'

To keep the association unique, a new column xy was introduced that you can drop in the end

In [24]: pd.wide_to_long(df, stubnames=['Code', 'Condition'], i='pID', j='xy', suffix='_.').reset_index('xy', drop=True).sort_index()
Out[24]: 
       Code    Condition
pID                     
10    'bl'      'Normal'
10   'm06'   'Alzheimer'
11    'bl'         'MCI'
11   'm06'         'MCI'
15    'bl'        'EMCI'
15   'm06'   'Alzheimer'

Thank you for your answer. Yes, this is for a study and I need to keep patients to see if their state has changed, I used this tables as an example the real tables have 96 columns :D — Trovão, Jun 23 '21 at 14:26

Joining two Pandas DataFrames with condition but keep both

3 Answers3