1

I am interested in a very 'Pandas-like' solution in the following problem. I have a straightforward implementation with looping over rows and checking conditions in both columns. I'm working on an NLP problem and need to locate tokens in sentences. I have two dataframes, one is the with start_0 and end_0 positions of tokens (drugs) and the second dataframe contains start_1 and end_1 of sentences. For example:

Position of tokens:

 df_0 =

 start_0   end_0    token
0     20      27    aspirin
1     50      59    trazodone
2     81      88    placebo
3    121     127    haldol

Position of sentences:

 df_1=

    start_1 end_1
0         0    17
1        17    29
2        29    46
3        46    64
4        64    76
5        76    81
6        81    97
7        97   227

I need to create a new column in df_1 and put in a corresponding row the token, namely:

df_1=

    start_1 end_1     token
0         0    17       NaN
1        17    29   aspirin
2        29    46       NaN
3        46    64 trazodone
4        64    76       NaN
5        76    81       NaN
6        81    97   placebo
7        97   227    haldol

Simply match two data frame if the position of a token is within the sentence. There must a simple solution with Pandas functionality, rather then looping over rows and checking both boundaries.

Arnold Klein
  • 2,956
  • 10
  • 31
  • 60

2 Answers2

1

It appears what you are really looking to do is map the words in df_0 to sentence boundaries defined in df_1. You can use pd.cut for this, taking care to include all boundaries in df_1:

boundaries = np.hstack((df_1['start_1'], df_1['end_1'].iloc[-1]))
boundary_labels = df_1.index

df_0['sentence'] = pd.cut(df_0['start_0'], boundaries, labels=boundary_labels, right=False)

print(df_0)

   start_0  end_0      token sentence
0       20     27    aspirin        1
1       50     59  trazodone        3
2       81     88    placebo        6
3      121    127     haldol        7

You can trivially map the df_0['sentence'] series back to df_1 as the values align perfectly with df_1['token'] in your desired output.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    what a neat solution! On the same note, I just found an interesting indexer: `IntervalIndex` : https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range – Arnold Klein Sep 09 '18 at 00:11
1

You can try using a for loop to accomplish this. I couldn't think of a vectorized version of it though.

import pandas as pd

df_0 = pd.DataFrame([[20, 27, "aspirin"], [50, 59, "trazodone"], [81, 88, "placebo"], [121, 127, "haldol"]], columns=["start_0", "end_0", "token"])
df_1 = pd.DataFrame([[0, 17], [17, 29], [29, 46], [46, 64], [64, 76], [76, 81], [81, 97], [97, 227]], columns=["start_1", "end_1"])

for i, row in df_0.iterrows():
    df_1.loc[(df_1.start_1 <= row.start_0) & (df_1.end_1 >= row.end_0), 'token'] = row.token

print(df_1)

Prints out the following:

   start_1  end_1      token
0        0     17        NaN
1       17     29    aspirin
2       29     46        NaN
3       46     64  trazodone
4       64     76        NaN
5       76     81        NaN
6       81     97    placebo
7       97    227     haldol
brandonwang
  • 1,603
  • 10
  • 17