0

I would like to pair corresponding matching values to a specific columns' values using Pandas without getting duplicates.

Data

df1

IN  name1   size    ttee    date        days    new date    year    month
AA1 dd      1       FALSE   1/14/2023   10      1/14/2023   2023    January
AA1 ff      2       FALSE   1/9/2023    10      1/9/2023    2023    January
AA1 jj      3       FALSE   1/8/2023    10      1/8/2023    2023    January
AA1 mm      4       FALSE   1/9/2023    10      1/9/2023    2023    January
AA1 nn      5       FALSE   1/29/2023   10      1/29/2023   2023    January

df2

name    stat    year    month   
AA1     KZZ:1   2022    January 
AA1     KZZ:1   2023    April   
AA1     KZZ:1   2023    March   
AA1     KZZ:1   2022    February    
AA1     KZZ:1   2022    April   
AA1     KZZ:1   2022    September
AA1     KZZ:1   2022    February    
AA1     KZZ:1   2022    March   
AA1     KZZ:1   2023    June    
AA1     KZZ:1   2022    December
AA1     KZZ:1   2022    January 
AA1     KZZ:1   2022    January 

Desired

IN  name1   size    ttee    date        days    new date    year    month   stat
AA1 dd      1       FALSE   1/14/2023   10      1/14/2023   2023    January KZZ:1
AA1 ff      2       FALSE   1/9/2023    10      1/9/2023    2023    January KZZ:1
AA1 jj      3       FALSE   1/8/2023    10      1/8/2023    2023    January KZZ:1
AA1 mm      4       FALSE   1/9/2023    10      1/9/2023    2023    January KZZ:1
AA1 nn      5       FALSE   1/29/2023   10      1/29/2023   2023    January KZZ:1

I've tried doing this:

out = pd.merge(df1,df2.drop_duplicates(), left_on=['IN'], right_on= ['name'], how="left")

However, the above script is giving an exploded output with combinations and does not retain the original left dataframe row count.

jared
  • 4,165
  • 1
  • 8
  • 31
Lynn
  • 4,292
  • 5
  • 21
  • 44
  • What are you trying to join them based on? It seems like the desired result is just `df1` with the `stat` column of `df2`. – jared Jul 14 '23 at 03:27
  • Hi @jared moreso, if values in the ‘name’ column of df2 dataset matches values in the ‘IN ‘column from df1 dataset, append matching values in the stat column of df2 dataset using pandas – Lynn Jul 14 '23 at 03:48
  • But they all match. They're all the same. – jared Jul 14 '23 at 04:17
  • Right, say there was a case where there are multiple different values in both specific columns. I would like to match to their corresponding value. – Lynn Jul 14 '23 at 04:30
  • You should share that example, not a special case. – jared Jul 14 '23 at 04:33
  • Do you have more columns than those shown in `df2`? Make sure to only consider relevant columns: `drop_duplicates(subset=['name', 'stat', 'year', 'month'])`. Also you should merge on IN/year/month vs name/year/month, not just IN/name. It's the same issue as in your previous question. Make sure to use a pen an paper to compute the merge manually and understand why it failed. As you specified a merge on IN/name, every time you find a matching name you copy the row, here all rows of `df2` match if you only merge on one key... – mozway Jul 14 '23 at 05:11
  • 1
    You should not **try**, you should **understand and know** which alternative you need. Don't program by trial-and-error changing things randomly until you get the desired result, understand the concepts. – mozway Jul 14 '23 at 06:57
  • Oh yes I always seek to understand the concepts fully, as this is how we can truly learn! Thank you for your intelligence @mozway – Lynn Jul 14 '23 at 21:04

1 Answers1

0

You can map:

mapping_dict = dict(zip(df2['IN'], df2['stat']))

#mapping to the 'IN' column in df1
df1['stat'] = df1['IN'].str.split(' ').str[0].map(mapping_dict)
Lynn
  • 4,292
  • 5
  • 21
  • 44
  • 2
    What you do here is to just ignore the year/month in `df2`, if this is a correct logic then you should have used `df2.drop_duplicates(subset=['IN'])` in your `merge`. This would however be incorrect if "start" also depends on "year"/"month" (see my other comment). – mozway Jul 14 '23 at 05:18