2

I have a df1 as:

enter image description here

There are a lot of duplicating values for SUBJECT_ID as shown in the picture. I have a df2 to merge from, but I want to merge it on unique SUBJECT_ID. For now I only know how to merge to entire SUBJECT_ID through this code:

df1 = pd.merge(df1,df2[['SUBJECT_ID', 'VALUE']], on='SUBJECT_ID', how='left' )

But this will merge on every SUBJECT_ID. I just need unique SUBJECT_ID. Please help me with this.

Cedric Zoppolo
  • 4,271
  • 6
  • 29
  • 59
Brian Wu
  • 57
  • 1
  • 6
  • 1
    Add code, errors, and data as text, not screenshots because [Stack Overflow Discourages Screenshots](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors). It is likely the question will be down-voted. You are discouraging assistance because no one wants to retype your data or code, and screenshots are often illegible. [edit] the question and add text. – Trenton McKinney Jul 16 '20 at 03:15
  • 2
    Please [create a reproducible copy of the DataFrame with `df.head(10).to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246/how-to-provide-a-copy-of-your-dataframe-with-to-clipboard), [edit] the question, and paste the clipboard into a code block or include synthetic data: [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Trenton McKinney Jul 16 '20 at 03:16
  • If there are so many duplicates, `.drop_duplicates` or `groupby`. – Trenton McKinney Jul 16 '20 at 03:18
  • I'm told to not drop_duplicates tho – Brian Wu Jul 16 '20 at 03:21
  • In copy and pasteable format what is df1 and df1 (include that in oyur question). Then, what is your desired output?: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – David Erickson Jul 16 '20 at 04:12
  • Like everyone is saying, create a short dummy dataframe using code so we can just copy/paste it instead of having to retype it. Example: `SUBJECT_ID = [31, 31, 31]` ... `GENDER = ['M', 'M', 'M']` ... etc. ... `df1 = pd.DataFrame(list(zip(SUBJECT_ID, GENDER)), columns =['SUBJECT_ID', 'GENDER'])` – a11 Jul 16 '20 at 04:23

2 Answers2

2

I think you will find your answer with the merge documentation.

It's not fully clear what you want, but here are some examples that may contain the answer you are looking for:

import pandas as pd
df1 = pd.read_csv('temp.csv')
display(df1)

SUBJECT_ID = [31, 32, 33]
something_interesting = ['cat', 'dog', 'fish']
df2 = pd.DataFrame(list(zip(SUBJECT_ID, something_interesting)), 
                   columns =['SUBJECT_ID', 'something_interesting']) 
display(df2)

enter image description here

df_keep_all = df1.merge(df2, on='SUBJECT_ID', how='outer')
display(df_keep_all)

enter image description here

df_keep_df1 = df1.merge(df2, on='SUBJECT_ID', how='inner')
display(df_keep_df1)

enter image description here

df_thinned = pd.merge(df1.drop_duplicates(), df2, on='SUBJECT_ID', how='inner')
display(df_thinned)

enter image description here

a11
  • 3,122
  • 4
  • 27
  • 66
1

You can use pandas drop function for it using this function you can remove all duplicate values for column or columns.

df2 = df.drop_duplicates(subset=['SUBJECT_ID'])`
tlentali
  • 3,407
  • 2
  • 14
  • 21