0

If I have two dataframes, as below.

df_1:

id  id_type
100 atype
101 atype
102 atype
603 another
604 another
605 another

and

df_2:

id_1  id_2 id_3
100   600  200
101   601  200
102   602  200
103   603  300
104   604  400
105   605  400

And I want to unify these to to use the second type ID starting with 600. OBVIOUSLY, the real ID differences are not just "add 500 to id_1 to get id_2", this is just for easy demonstration.

So I want to create a new unified_id column, where I only use the 600-type IDs. So the output would be:

df_1:

id  id_type  unified_id
100 atype    600
101 atype    601
102 atype    602
603 another  603
604 another  604
605 another  605

So I have a conditional new column creation, where the condition is "if the type is 'another', keep the ID from the current DF, but if not, read the ID from df_2". This is different than stuff like this for example, as the data doesn't need to be filled by a condition and a number of fixed values, but a condition and values that need to be read from either the current or another df.

How can I achieve this?

[EDIT]

No, this is NOT covered in merging 101. This is a calculated column question, not a merging one. Merging doesn't combine half a column from one df and another from another. I don't know how this is not clear. If I merge on left - id and right - id_1, I get a new column that's half correct and half NaN. That's not what I want. I need to conditionally read certain values from 1 df and another from the current.

lte__
  • 7,175
  • 25
  • 74
  • 131

1 Answers1

0

You can melt and merge. For the demo I added a columns col to df_2:

   id_1  id_2  col
0   100   600    0
1   101   601    1
2   102   602    2
3   103   603    3
4   104   604    4
5   105   605    5

command:

df_1.merge(df_2.melt(id_vars=['col'], value_name='id'), on='id', how='left')

output:

    id  id_type  col variable
0  100    atype    0     id_1
1  101    atype    1     id_1
2  102    atype    2     id_1
3  603  another    3     id_2
4  604  another    4     id_2
5  605  another    5     id_2
alternative

perform 2 consecutive merges with combine_first:

df_2 = df_2.convert_dtypes() # optional, to avoid conversion to float

out = (df_1
 .merge(df_2, left_on='id', right_on='id_1', how='left')
 .combine_first(df_1.merge(df_2, left_on='id', right_on='id_2', how='left'))
)

output:

    id  id_type  id_1  id_2  col
0  100    atype   100   600    0
1  101    atype   101   601    1
2  102    atype   102   602    2
3  603  another   103   603    3
4  604  another   104   604    4
5  605  another   105   605    5
mozway
  • 194,879
  • 13
  • 39
  • 75
  • I'll look into what melt does but your answer output doesn't match my provided example output. – lte__ Apr 03 '22 at 13:45
  • @lte__ check the alternative – mozway Apr 03 '22 at 13:47
  • Alternative is almost fine, I could do without the other columns I don't need. Also, it's a workaround to the issue. Is there no way where you can do a conditional new column creation with Pandas? This does a bunch of unnecessary calculations that could slow down the process with a large dataset. Why re-merge with df_2 when half the data is in there already in df_1? – lte__ Apr 03 '22 at 13:49
  • There are many options, you can also do `df['unified_id'] = df_1['id'].map(df_2.set_index('id_1')['id_2']).fillna(df_1['id'])` – mozway Apr 03 '22 at 13:50