3

I have two dataframe as follows:

df1

SYMBOL  seqnames    start      end    SampleID
SPATA21   1       16736303  16736303    eAPD114
E2F2      1       23836607  23836607    eAPD114
FCN3      1       27701288  27701288    eAPD120
MARCKSL   1       32800671  32800671    KAPD144
MARCKSL   1       32800671  32800671    eAPD184
LRRC40    1       70644607  70644607    eAPD184
KREMEN1  22       29536275  29536275    eAPD005
KIF14     1       200569584 200569584   eAPD081
RGS7BP    5       63802465  63802465    YAPD025
PCDHB6    5       140531231 140531231   YAPD025
SERPINB4 18       61305310  61305310    eAPD081

df2

SYMBOL  seqnames    start      end
SPATA21   1       16736303  16736303
E2F2      1       23836607  23836607
FCN3      1       27701288  27701288
MARCKSL   1       32800671  32800671
LRRC40    1       70644607  70644607
KREMEN1  22       29536275  29536275
SERPINB4 18       61305310  61305310
SERPINB4 21       61305310  61305310

I want to map df1 to df2 and represent each SampleID as separate column with filling 0 and 1 if there is a match between df1 and df2:

Expected output:

SYMBOL  seqnames    start      end      eAPD114 eAPD120 KAPD144 eAPD184 eAPD005 eAPD081 YAPD025
SPATA21   1       16736303  16736303       1       0       0       0       0       0       0
E2F2      1       23836607  23836607       1       0       0       0       0       0       0
FCN3      1       27701288  27701288       0       1       0       0       0       0       0
MARCKSL   1       32800671  32800671       0       0       1       1       0       0       0
LRRC40    1       70644607  70644607       0       0       0       1       0       0       0
KREMEN1  22       29536275  29536275       0       0       0       0       1       0       0
SERPINB4 18       61305310  61305310       0       0       0       0       0       1       0
SERPINB4 21       61305310  61305310       0       0       0       0       0       0       0

I tried using the pivot method mentioned here . But did not work efficiently

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252

1 Answers1

1

Use:

cols = ['SYMBOL','seqnames','start','end']
#left join between both DataFrames
df = df2.merge(df1, on=cols, how='left')
#convert column SampleID to indicators,get max and last add missing df1['SampleID']
df = (df.join(pd.get_dummies(df.pop('SampleID')))
        .groupby(cols).max()
        .reindex(df1['SampleID'].unique(), axis=1, fill_value=0)
        .reset_index())
print (df)
     SYMBOL  seqnames     start       end  eAPD114  eAPD120  KAPD144  eAPD184  \
0      E2F2         1  23836607  23836607        1        0        0        0   
1      FCN3         1  27701288  27701288        0        1        0        0   
2   KREMEN1        22  29536275  29536275        0        0        0        0   
3    LRRC40         1  70644607  70644607        0        0        0        1   
4   MARCKSL         1  32800671  32800671        0        0        1        1   
5  SERPINB4        18  61305310  61305310        0        0        0        0   
6  SERPINB4        21  61305310  61305310        0        0        0        0   
7   SPATA21         1  16736303  16736303        1        0        0        0   

   eAPD005  eAPD081  YAPD025  
0        0        0        0  
1        0        0        0  
2        1        0        0  
3        0        0        0  
4        0        0        0  
5        0        1        0  
6        0        0        0  
7        0        0        0  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • #jezrael. this code do not work correctly when there is a mismatch in the `start` and `end` column of `df1` and `df2` –  Jul 10 '20 at 06:12
  • @user288925 - So need first filter only rows if strats is less like end in both DataFrames? Like `df1 = df1[df1['start'] < df1['end']]` and similar for `df2` ? – jezrael Jul 10 '20 at 06:14
  • 1
    the output should be like when all the 4 columns are mapped to both dataframe correctly `SYMBOL` , `seqnames`, `start ` , `end`. The given code is like when `SYMBOL` , `seqnames` are same in both dataframe it get mapped as assigns the values. But in real time all 4 column can have different values as well. In that case the code fails –  Jul 10 '20 at 06:18
  • @user288925 - oops, then it is simplier, give me some time. – jezrael Jul 10 '20 at 06:23