Replace dataframe rows with identical rows from another dataframe on a column value

Question

I have a dataframe data in which I took a subset of it g2_data to perform some operations on. How would I go about replacing values in the original dataframe with values from the subset, using values from one of the columns as the reference?

The column structure from data is retained in the subset g2_data shown below.

data:
         idx  group   x1   y1
0         27      1  0.0  0.0
1         28      1  0.0  0.0
2         29      1  0.0  0.0
3         73      1  0.0  0.0
4         74      1  0.0  0.0
     ...    ...  ...  ...
14612  14674      8  0.0  0.0
14613  14697      8  0.0  0.0
14614  14698      8  0.0  0.0
14615  14721      8  0.0  0.0
14616  14722      8  0.0  0.0

[14617 rows x 4 columns]

g2_data:
       idx  group     x1     y1
1125  1227      2  115.0    0.0
1126  1228      2    0.0  220.0
1127  1260      2    0.0    0.0
1128  1294      2    0.0    0.0
1129  1295      2    0.0    0.0
   ...    ...    ...    ...
3269  3277      2    0.0    0.0
3270  3308      2    0.0    0.0
3271  3309      2    0.0    0.0
3272  3342      2    0.0    0.0
3273  3343      2    0.0    0.0

[2149 rows x 4 columns]

Replace rows in Dataframe using index from another Dataframe has an answer to do this using the index values of the rows, but I would like to do it using the values from the idx column incase I need to reset the index in the subset later on (i.e. starting from 0 instead of using the index values from the original dataframe). It is important to note that the values in the idx column are all unique as they pertain to info about each observation.

why not just use that answer, but temporarily set the index to your idx column? `data.set_index('idx').loc[g2_data.set_index('idx').index, :] = g2_data[:]` — StevenS, Oct 15 '21 at 20:56
@StevenS it seems this retains the original info from ```data```. — tcolbert, Oct 15 '21 at 21:20
Is it possible an index `idx` exists in `g2_data` and not in `data`? — Corralien, Oct 15 '21 at 22:01
@Corralien I wouldn't think so as ```g2_data``` was taken directly from ```data``` with the condition that ```group == 2```. — tcolbert, Oct 15 '21 at 23:49

score 1 · Accepted Answer · answered Oct 15 '21 at 22:07

This probably isn't optimal, but you can convert g2_data to a dictionary and then map the other columns based on idx, filtering the update to just those ids in the g2_data subset.

g2_data_dict = g2_data.set_index('idx').to_dict()
g2_data_ids = g2_data['idx'].to_list()

for k in g2_data_dict.keys():
    data.loc[df['idx'].isin(g2_data_ids), k] = data['idx'].map(g2_data_dict[k])

score 1 · Answer 2 · answered Oct 15 '21 at 22:07

1

Use combine_first:

out = g2_data.set_index('idx').combine_first(data.set_index('idx')).reset_index()

answered Oct 15 '21 at 22:07

Corralien

109,409
8
28
52

This does fill in the updated values, but it reorders the dataframe. Does setting the index to ```'idx'``` inherently reorder to fit the new index? Is there a way to maintain the original order while also setting a new index? – tcolbert Oct 16 '21 at 00:12
No, setting an index doesn't reorder a dataframe. What's happening here is that g2_data is the starting point, so the extra IDs in data are added to the bottom. If you want to keep your data sorted by idx then you can just add sort_index() before reset_index(). – StevenS Oct 16 '21 at 03:37

Replace dataframe rows with identical rows from another dataframe on a column value

2 Answers2