-1

I have 2 dataframes I want to merge together. In df1 it has an 'id' of A, B, C,... and in df2 it has an 'id' of A1, A2, A3, ..., B1, B2, B3... I want to merge them on condition of df1.id in df2.id. I've looked at tutorials, and searched StackOverflow, but not seeing anything relevant.

The keys of df1 is a subset of the keys in df2. So a simple pd.merge(df1, df2, on='id', how...) results in a dataframe that's not correct.

SETUP:

dummy_data1 = {
        'id': ['A', 'B', 'C', 'D', 'E'],
        'Feature1': ['1', '2', '3', '4', '5'],
        'Feature2': ['6', '7', '8', '9', '10']}

dummy_data2 = {
        'id': ['A1', 'A2', 'A3', 'B1', 'B2'],
        'Feature1': ['a', 'b', 'c', 'd', 'e'],
        'Feature2': ['f', 'g', 'h', 'i', 'j']}

df1 = pd.DataFrame(dummy_data1, columns = dummy_data1.keys())
df2 = pd.DataFrame(dummy_data2, columns = dummy_data2.keys())

Desired output:

    id   Feature1_x   Feature2_x   Feature1_y   Feature2_y
0   A1   1            6            a            f
1   A2   1            6            b            g
2   A3   1            6            c            h
3   B1   2            7            d            i
4   B2   2            7            e            j

ATTEMPT 1:

newdf = pd.merge(df1['id'], df2['id'], on='id', how='inner')

RESULT 1:

Empty DataFrame
Columns: [id]
Index: []

ATTEMPT 2:

newdf = pd.merge(df1['id'], df2['id'], on='id', how='outer')

RESULT 2:

   id
0   A
1   B
2   C
3   D
4   E
5  A1
6  A2
7  A3
8  B1
9  B2

ATTEMPT 3:

newdf = pd.merge(df1['id'], df2['id'].str[:1], on='id', how='inner')

RESULT 3:

  id
0  A
1  A
2  A
3  B
4  B
cupofcalculus
  • 51
  • 1
  • 8
  • Possible duplicate of [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) Have a look at this, it will answer your question, you need to decide whether or not you want to do an inner, outer or left join – Umar.H Oct 20 '19 at 00:31
  • also this is as simple as `pd.merge(df1['id'], df2['id'], on='id',how...)` good luck – Umar.H Oct 20 '19 at 00:32
  • None of those answer my question. The `on='id', how...` doesn't suffice, because the two id's are not equal to each other, but one is a subset of the other. That is why I need the `string1 in string2` condition. – cupofcalculus Oct 20 '19 at 01:30
  • Can you take a slice of df2 id to do the merge? `pd.merge(df1['id'], df2['id'].str[:6], on='id',how...)` – manwithfewneeds Oct 20 '19 at 01:34
  • I tried your suggestion, and got the substring portion to return, but I need the superstring. I need `A1, A2, A3...` to be in the resulting df, and not just `A, A, A,...`. – cupofcalculus Oct 20 '19 at 01:43
  • Added a "Desired Output" section. – cupofcalculus Oct 20 '19 at 02:30

1 Answers1

2

You can create a new column that only has Letters from 'id'. You can change regex as per your needs. Then you can merge on that new column and finally have desired columns in the result.

df2['new_id'] = df2['id'].apply(lambda x: re.search(r'[A-Z]',x).group())

df1.merge(df2, left_on='id',right_on='new_id')[['id_y','Feature1_x','Feature2_x','Feature1_y','Feature2_y']]

    id_y    Feature1_x  Feature2_x  Feature1_y  Feature2_y
0   A1          1          6           a         f
1   A2          1          6           b         g
2   A3          1          6           c         h
3   B1          2          7           d         i
4   B2          2          7           e         j

Assumption: you have id in df2 of type like 'A1', 'A2', 'B1' , etc.

vb_rises
  • 1,847
  • 1
  • 9
  • 14