4

I'm trying to combine two different dataframes I've imported in python with pandas. They are the results of some eye-tracking I've done. One of them however contains both the class and method the user has gazed upon, meaning that for every row dataframe1 has dataframe2 has an extra one. Now this doesn't happen in every row so I can't just duplicate the rows, but what I was thinking was to add another row every time the index of dataframe2 had two of the same indices. Kinda like this:

dataframe1 = pd.DataFrame({'index':[1,2,3],'a':['asd','fgh','qwe'],'b':['dsa','hgf','ewq'],'c':['sad','gfh','wqe']})
dataframe1=dataframe1[['index','a','b','c']]
dataframe1
   index    a    b    c
0      1  asd  dsa  sad
1      2  fgh  hgf  gfh
2      3  qwe  ewq  wqe

dataframe2 = pd.DataFrame({'index':[1,1,2,3,3],'d':['zxc','cxz','xzc','zxc','xcz']})
dataframe2=dataframe2[['index','d']]
dataframe2
   index    d
0      1  zxc
1      1  cxz
2      2  xzc
3      3  zxc
4      3  xcz

Expected Result:

index, a, b, c, d
1, asd, dsa, sad, zxc
1, nan, nan, nan, cxz
2, fgh, hgf, gfh, xzc
3, qwe, ewq, wqe, zxc
3, nan, nan, nan, xcz

Any built in functions to use? The values can also just be the values of the previous line with the same index.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Take a look here if you want to learn more about merging dataframes in pandas: https://pandas.pydata.org/pandas-docs/stable/merging.html – gosuto Dec 26 '18 at 12:22
  • @jorijnsmit Or, my favourite link to recommend is https://stackoverflow.com/questions/53645882/pandas-merging-101 ;-) – cs95 Dec 26 '18 at 12:26

2 Answers2

4

Use pd.merge with an additional cumcounted column:

u = df2.assign(cnt=df2.groupby('index').cumcount())
v = df.assign(cnt=df.groupby('index').cumcount())

u.merge(v, on=['index', 'cnt'], how='left').drop('cnt', 1)

   index    d    a    b    c
0      1  zxc  asd  dsa  sad
1      1  cxz  NaN  NaN  NaN
2      2  xzc  fgh  hgf  gfh
3      3  zxc  qwe  ewq  wqe
4      3  xcz  NaN  NaN  NaN

Details

We introduce cumulative counts for the duplicate values in "index".

u = df2.assign(cnt=df2.groupby('index').cumcount())
u
   index    d  cnt
0      1  zxc    0
1      1  cxz    1
2      2  xzc    0
3      3  zxc    0
4      3  xcz    1

v = df.assign(cnt=df.groupby('index').cumcount())
v
   index    a    b    c  cnt
0      1  asd  dsa  sad    0
1      2  fgh  hgf  gfh    0
2      3  qwe  ewq  wqe    0

We then force a LEFT JOIN wrt u on "index" and "cnt". This way, NaNs are introduced int the result:

u.merge(v, on=['index', 'cnt'], how='left')

   index    d  cnt    a    b    c
0      1  zxc    0  asd  dsa  sad
1      1  cxz    1  NaN  NaN  NaN
2      2  xzc    0  fgh  hgf  gfh
3      3  zxc    0  qwe  ewq  wqe
4      3  xcz    1  NaN  NaN  NaN

The last step is to delete the temporary "cnt" column.

cs95
  • 379,657
  • 97
  • 704
  • 746
2

Using merge with mask and duplicated:

df = df2.merge(df1)
cols = ['index','a','b','c']
df[['a','b','c']] = df[cols].mask(df[cols].duplicated())[['a','b','c']]

print(df)
   index    d    a    b    c
0      1  zxc  asd  dsa  sad
1      1  cxz  NaN  NaN  NaN
2      2  xzc  fgh  hgf  gfh
3      3  zxc  qwe  ewq  wqe
4      3  xcz  NaN  NaN  NaN
Space Impact
  • 13,085
  • 23
  • 48