1

My question is an extension of this question:

Check if value in a dataframe is between two values in another dataframe

df1

    df1_Col     df1_start
0   A1          1200        
1   B2          4000        
2   B2          2500       

df2

    df2_Col     df2_start   df2_end       data
0   A1          1000        2000          DATA_A1
1   A1          900         1500          DATA_A1_A1
**2   A1          2000        3000          DATA_A1_A1_A1**
2   B1          2000        3000          DATA_B1
3   B2          2000        3000          DATA_B2

output:

    df1_Col     df1_start     data
0   A1          1200          DATA_A1;DATA_A1_A1
1   B2          4000          
2   B2          2500          DATA_B2

I am comparing the value of df1_Col to match with df2_Col and df1_start to be within the range of df2_start and df2_end, then add values of data column in df1. If there multiple matches, then data can combine with any delimiter like ';'.

The code is as follows:

for v,ch in zip(df1.df1_start, df1.df1_Col):
        df3 = df2[(df2['df2_start'] < v) & (df2['df2_end'] > v) & (df2['df2_Col'] ==ch)]
        data = df3['data']
        df1['data'] = data

Loops are used because file is huge.

EDIT: enter image description here

Looking forward for your assistance.

soosa
  • 125
  • 11

1 Answers1

2

IIUC:

try via merge()+groupby()+agg():

Left merge on df1 then check if 'df1_start' falls between 'df2_start' and 'df2_end' and creating column 'data' and setting it's value equal to None.Then we are grouping on ['df1_Col','df1_start'] and joining the values of 'date' seperated by ';' by dropping None:

out=df1.merge(df2,left_on='df1_Col',right_on='df2_Col',how='left',sort=True)
out.loc[~out['df1_start'].between(out['df2_start'], out['df2_end']), 'data'] = None
out=out.groupby(['df1_Col','df1_start'],as_index=False,sort=False)['data'].agg(lambda x:';'.join(x.dropna()))

output of out:

   df1_Col  df1_start       data
0   A1          1200        DATA_A1;DATA_A1_A1
1   B2          4000    
2   B2          2500        DATA_B2
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
  • Excellent answer, thanks for it. But the only issue I am facing with the first script currently is that it adds the asterisk row (edited in the question) also in the data although it does not match the criteria of being in between the`df2_start` and `df2_end`. The second script adds ";" for the rows it is matched as "False". For example, the data table looks like this "DATA_A1;;;;;;;;DATA_A1_A1" considering I have more than 60,000 rows, the data column is kinda ambiguous. – soosa Aug 11 '21 at 06:26
  • I have tried the modified code:`out=out.groupby(['chr','start','end'],as_index=False,sort=False)['data'].agg(lambda x: ';'.join(x.dropna()))` but didn't owrk. – soosa Aug 11 '21 at 06:52
  • @aso so both script is working fine for you except that data column looks like this `"DATA_A1;;;;;;;;DATA_A1_A1"`? – Anurag Dabas Aug 11 '21 at 07:08
  • First script gives a false positive `data` value mentioned with asterisk in the question `DATA_A1;DATA_A1_A1;DATA_A1_A1_A1` although the third entry doesn't match the criteria in between the range `df2_start` and `df2_end` (Added a picture in the question for your understanding). Second script gives me the ambiguous `data` column. – soosa Aug 11 '21 at 07:17
  • @aso thanks sir for the efforts and patience.....updated answer...kindly have a look **:)** – Anurag Dabas Aug 11 '21 at 07:33
  • 1
    Perfectly worked... Really appreciate your help. – soosa Aug 11 '21 at 09:34