1

I have two data frames which look like this

df1

              name  ID      abb
     0         foo  251803  I
     1         bar  376811  R
     2         baz  174254  Q
     3      foofoo  337144  IRQ
     4      barbar  306521  IQ

df2

          abb    comment
     0      I       fine
     1      R     repeat
     2      Q      other

I am trying to use pandas merge to join the two data frames and simply assign the comment column in the second data frame to the first based on the abb column in the following way:

df1.merge(df2, how='inner', on='abb')

resulting in:

              name  ID      abb  comment
     0         foo  251803  I       fine
     1         bar  376811  R     repeat
     2         baz  174254  Q      other

This works well for the unique one letter identifiers in abb. However, it obviously fails for more than one character.

I tried to use list on the abb column in first data frame but this results in a KeyError.

What I would like to do is the following.

1) Seperate the rows containing more than one character in this column into several rows

2) Merge the data frames

3) Optionally: Combine the rows again

Fourier
  • 2,795
  • 3
  • 25
  • 39

2 Answers2

2

Use join:

print (df1)
     name      ID  abb
0     foo  251803    I
1     bar  376811    R
2     baz  174254    Q
3  foofoo  337144  IRQ
4  barbar  306521   IQ

#each character to df, which is stacked to Series
s = df1.abb.apply(lambda x: pd.Series(list(x)))
           .stack()
           .reset_index(drop=True, level=1)
           .rename('abb')
print (s)
0    I
1    R
2    Q
3    I
3    R
3    Q
4    I
4    Q
Name: abb, dtype: object

df1 = df1.drop('abb', axis=1).join(s)
print (df1)
     name      ID abb
0     foo  251803   I
1     bar  376811   R
2     baz  174254   Q
3  foofoo  337144   I
3  foofoo  337144   R
3  foofoo  337144   Q
4  barbar  306521   I
4  barbar  306521   Q
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • the columns come as I have listed above. So I do not need to join them. However, I cannot use split on 'IRQ' to seperate the string into 'I', 'R' and 'Q' to use stack to obtain three columns. – Fourier Jul 21 '16 at 08:16
  • Very insightful answer. Thank you very much. I had the list already figured but not using apply which is the hat trick here. – Fourier Jul 21 '16 at 08:21
1

See this answer for various ways to explode on a column

rows = []
for i, row in df1.iterrows():
    for a in row.abb:
        rows.append([row['ID'], a, row['name']])

df11 = pd.DataFrame(rows, columns=df1.columns)

df11.merge(df2)

enter image description here

Community
  • 1
  • 1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • looking at your original answer, this might be slightly faster than the above solution @piRSquared – Fourier Jul 21 '16 at 08:55