1

I have two DataFrames. One looks like this:

df1.head()
#CHR    Start   End Name
chr1    141474  173862  SAP
chr1    745489  753092  ARB
chr1    762988  794826  SAS
chr1    1634175 1669127 ETH
chr1    2281853 2284259 BRB

And the second DataFrame looks as follows:

df2.head()
#chr    start   end
chr1    141477  173860
chr1    745500  753000
chr16   56228385    56229180
chr11   101785507   101786117
chr7    101961796   101962267

I am looking to map the first three columns from two DataFrames and create a new DataFrame, df3. For example, if #chr from both df1 and df2 are equal, then look for df2.start >= df1.start and df2.end <= df1.end.

If this is the case, print it out as the following:

df3.head()

#chr    start   end Name
chr1    141477  173860  SAP
chr1    745500  753000  ARB

So far I have tried to create a function for doing this:

def start_smaller_than_end(df1,df2):
    if df1.CHR == df2.CHR:
        df2.start >= df1.Start
        df2.End <= df2.End

    return df3

However, when I run it I get the following error:

df3(df1, df2)
name 'df3' is not defined

Any suggestions and help are greatly appreciated.

ARJ
  • 2,021
  • 4
  • 27
  • 52
  • check this question : http://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others – milos.ai Sep 08 '16 at 12:33

1 Answers1

2

I think you can use merge with boolean indexing:

df = pd.merge(df1, df2, how='outer', left_on='#CHR', right_on='#chr')

df = df[(df.start >= df.Start) & (df.end <= df.End)]
df = df[['#chr','start','end','Name']]
print (df)
   #chr   start     end Name
0  chr1  141477  173860  SAP
3  chr1  745500  753000  ARB

EDIT by comment:

Function start_smaller_than_end:

def start_smaller_than_end(df1,df2):
    df = pd.merge(df1, df2, how='outer', left_on='#CHR', right_on='#chr')
    df = df[(df.start >= df.Start) & (df.end <= df.End)]
    df = df[['#chr','start','end','Name']]
    return df

df3 = start_smaller_than_end(df1,df2)
print (df3)    
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you but both data frame is of different length, and so its throwing this error, ValueError: Series lengths must match to compare – ARJ Sep 08 '16 at 12:45
  • Yes, but you need first merge it and compare in `merged` dataframe `df`. Is still problem after merge? – jezrael Sep 08 '16 at 12:52
  • Sorry, I didn't try on merged df , I see, it is working now. Thank you so much ..!!! – ARJ Sep 08 '16 at 12:55
  • Is there a possible way to make it as a function as I defined above with your solution ? – ARJ Sep 08 '16 at 12:56
  • Sure, give me a sec. – jezrael Sep 08 '16 at 12:57
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/122907/discussion-between-user1017373-and-jezrael). – ARJ Sep 08 '16 at 13:07