11

I searched a lot for an answer, the closest question was Compare 2 columns of 2 different pandas dataframes, if the same insert 1 into the other in Python, but the answer to this person's particular problem was a simple merge, which doesn't answer the question in a general way.

I have two large dataframes, df1 (typically about 10 million rows), and df2 (about 130 million rows). I need to update values in three columns of df1 with values from three columns of df2, based on two df1 columns matching two df2 columns. It is imperative that the order of df1 remains unchanged, and that only rows with matching values get updated.

This is how the dataframes look like:

df1

chr    snp  x    pos a1 a2
1  1-10020  0  10020  G  A    
1  1-10056  0  10056  C  G    
1  1-10108  0  10108  C  G
1  1-10109  0  10109  C  G    
1  1-10139  0  10139  C  T

Note that it's not always the case that the values of "snp" is chr-pos, it can take many other values with no link to any of the columns (like rs1234, indel-6032 etc)

df2

ID           CHR   STOP  OCHR  OSTOP
rs376643643    1  10040     1  10020
rs373328635    1  10066     1  10056    
rs62651026     1  10208     1  10108    
rs376007522    1  10209     1  10109   
rs368469931    3  30247     1  10139

I need to update ['snp', 'chr', 'pos'] in df1 with df2[['ID', 'OCHR', 'OSTOP']] only when df1[['chr', 'pos']] matches df2[['OCHR', 'OSTOP']]

so in this case, after update, df1 would look like:

chr       snp  x     pos a1 a2    
1  rs376643643  0  10040  G  A    
1  rs373328635  0  10066  C  G    
1  rs62651026   0  10208  C  G    
1  rs376007522  0  10209  C  G    
3  rs368469931  0  30247  C  T

I have used merge as a workaround:

df1 = pd.merge(df1, df2, how='left', left_on=["chr", "pos"], right_on=["OCHR", "OSTOP"],
                                     left_index=False, right_index=False, sort=False)

and then

df1.loc[~df1.OCHR.isnull(), ["snp", "chr", "pos"]] = df1.loc[~df1.OCHR.isnull(), ["ID", "CHR", "STOP"]].values

and then remove the extra columns.

Yes, it works, but what would be a way to do that directly by comparing the values from both dataframes, I just don't know how to formulate it, and I couldn't find an answer anywhere; I guess it could be useful to get a general answer on this.

I tried that but it doesn't work:

df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["snp", "chr", "pos"]] = df2.loc[df2[['OCHR', 'OSTOP']] == df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["chr", "pos"]],['ID', ''CHR', 'STOP']].values

Thanks,

Stephane

Community
  • 1
  • 1
Stephane
  • 111
  • 1
  • 4
  • 1
    "I need to update ['snp', 'chr', 'pos'] in df1 with df2[['ID', 'OCHR', 'OSTOP']] only when df1[['chr', 'pos']] matches df2[['OCHR', 'OSTOP']]" Isn't this just saying you need to update `df1.snp` with `df2.ID` only when `char` and `pos` in df1 match `OCHR` and `OSTOP` in df2? – Alexander Jul 31 '15 at 19:03
  • If you merge chr,ochr and pos,ostop, then there is no need to update.. maybe you mean to update chr->CHR and post-> STOP ? – dermen Jul 31 '15 at 19:44
  • this is how it is in your example, after the merge, you want to update df1.chr -> df2.CHR and df1.pos -> df2.STOP, maybe correct that if its a typo – dermen Jul 31 '15 at 19:45
  • This can be done in one line of an SQL Update join query. If your dfs derive from a database, consider the power of an SQL relational engine. – Parfait Jul 31 '15 at 20:27
  • @Alexander, not exactly, all three columns (snp, chr and pos) need to be updated when chr and pos in df1 match OCHR and OSTOP in df2. – Stephane Aug 02 '15 at 21:37
  • @Parfait, no, it's not from a SQL database unfortunately. SQL is indeed very powerful for this kind of things. Thanks. – Stephane Aug 02 '15 at 21:42

2 Answers2

11

You can use the update function (requires setting the matching criteria to index). I've modified your sample data to allow some mismatch.

# your data
# =====================
# df1 pos is modified from 10020 to 10010
print(df1)

   chr      snp  x    pos a1 a2
0    1  1-10020  0  10010  G  A
1    1  1-10056  0  10056  C  G
2    1  1-10108  0  10108  C  G
3    1  1-10109  0  10109  C  G
4    1  1-10139  0  10139  C  T

print(df2)

            ID  CHR   STOP  OCHR  OSTOP
0  rs376643643    1  10040     1  10020
1  rs373328635    1  10066     1  10056
2   rs62651026    1  10208     1  10108
3  rs376007522    1  10209     1  10109
4  rs368469931    3  30247     1  10139

# processing
# ==========================
# set matching columns to multi-level index
x1 = df1.set_index(['chr', 'pos'])['snp']
x2 = df2.set_index(['OCHR', 'OSTOP'])['ID']
# call update function, this is inplace
x1.update(x2)
# replace the values in original df1
df1['snp'] = x1.values
print(df1)

   chr          snp  x    pos a1 a2
0    1      1-10020  0  10010  G  A
1    1  rs373328635  0  10056  C  G
2    1   rs62651026  0  10108  C  G
3    1  rs376007522  0  10109  C  G
4    1  rs368469931  0  10139  C  T
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • Hi Jianxun, that looks very interesting, I've never used multiple indexes. In you example it only updates the 'snp' column, if I were to use x1 = df1.set_index(['chr', 'pos'])['snp', 'chr', 'pos'] x2 = df2.set_index(['OCHR', 'OSTOP'])['ID', 'CHR', 'STOP'] would that update all three columns when calling x1.update(x2)? (I don't see why not) Thanks a lot. – Stephane Aug 02 '15 at 21:14
  • absolutely brilliant – umbe1987 Jul 08 '21 at 07:56
2

Start by renaiming the columns you want to merge in df2

df2.rename(columns={'OCHR':'chr','OSTOP':'pos'},inplace=True)

Now merge on these columns

df_merged = pd.merge(df1, df2, how='inner', on=['chr', 'pos']) # you might have to preserve the df1 index at this stage, not sure

Next, you want to

updater = df_merged[['D','CHR','STOP']] #this will be your update frame
updater.rename( columns={'D':'snp','CHR':'chr','STOP':'pos'},inplace=True) # rename columns to update original

Finally update (see bottom of this link):

df1.update( df1_updater) #updates in place
#  chr          snp  x    pos a1 a2
#0   1  rs376643643  0  10040  G  A
#1   1  rs373328635  0  10066  C  G
#2   1   rs62651026  0  10208  C  G
#3   1  rs376007522  0  10209  C  G
#4   3  rs368469931  0  30247  C  T

update works by matching index/column so you might have to string along the index of df1 for the entire process, then do df1_updater.re_index(... before df1.update(df1_updater)

dermen
  • 5,252
  • 4
  • 23
  • 34