R unnest with Sentence start and end positions

Question

New to R. I am using tidytext::unnest_tokens to break down a long text into individual sentences using below

tidy_drugs <- drugstext.raw %>% unnest_tokens(sentence, Section, token="sentences")

So I get a data.frame with all the sentences converted into rows.

I would like to get the start and end positions for each sentence that is unnested from the long text.

Here is a sample of the long text file. It is from a drug label.

<< *6.1 Clinical Trial Experience
  Because clinical trials are conducted under widely varying conditions, adverse reaction rates observed in clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice.
 The data below reflect exposure to ARDECRETRIS as monotherapy in 327 patients with classical Hodgkin lymphoma (HL) and systemic anaplastic large cell lymphoma (sALCL), including 160 patients in two uncontrolled single-arm trials (Studies 1 and 2) and 167 patients in one placebo-controlled randomized trial (Study 3).
 In Studies 1 and 2, the most common adverse reactions were neutropenia, fatigue, nausea, anemia, cough, and vomiting.*

The desired result is a dataframe with three columns

Dataframe

When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Feb 23 '18 at 17:00
Added some more content to the problem statement. Thanks for helping. — Krishna, Feb 23 '18 at 17:43

score 1 · Answer 1 · answered Feb 24 '18 at 00:39

You can do this with str_locate from stringr. This is mostly annoying because newlines and special characters can mess up regular expressions that you search with. Here we first remove newlines from the input text with str_replace_all, then unnest the tokens making sure to keep the original text and to prevent case change. Then, we make a new regex column, replacing special characters (here (, ), and .) with properly escaped versions, and use str_locate to add the start and end of each string.

I don't get the same numbers as you, but I copied the text from your code which doesn't always keep all characters, and your final end number is smaller than the start in any case.

library(tidyverse)
library(tidytext)

raw_text <- tibble(section = "6.1 Clinical Trial Experience
  Because clinical trials are conducted under widely varying conditions, adverse reaction rates observed in clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice.
                   The data below reflect exposure to ARDECRETRIS as monotherapy in 327 patients with classical Hodgkin lymphoma (HL) and systemic anaplastic large cell lymphoma (sALCL), including 160 patients in two uncontrolled single-arm trials (Studies 1 and 2) and 167 patients in one placebo-controlled randomized trial (Study 3).
                   In Studies 1 and 2, the most common adverse reactions were neutropenia, fatigue, nausea, anemia, cough, and vomiting."
)

tidy_text <- raw_text %>%
  mutate(section = str_replace_all(section, "\\n", "")) %>%
  unnest_tokens(
    output = sentence,
    input = section,
    token = "sentences",
    drop = FALSE,
    to_lower = FALSE
    ) %>%
  mutate(
    regex = str_replace_all(sentence, "\\(", "\\\\("),
    regex = str_replace_all(regex, "\\)", "\\\\)"),
    regex = str_replace_all(regex, "\\.", "\\\\.")
  ) %>%
  mutate(
    start = str_locate(section, regex)[, 1],
    end = str_locate(section, regex)[, 2]
  ) %>%
  select(sentence, start, end) %>%
  print()
#> # A tibble: 3 x 3
#>   sentence                                                     start   end
#>   <chr>                                                        <int> <int>
#> 1 6.1 Clinical Trial Experience  Because clinical trials are ~     1   290
#> 2 The data below reflect exposure to ARDECRETRIS as monothera~   310   626
#> 3 In Studies 1 and 2, the most common adverse reactions were ~   646   762

Created on 2018-02-23 by the reprex package (v0.2.0).

Calum, Amazing help !!. Appreciate a lot !! Would you know why I am getting the following error when I am running your code as-is -- Error in stri_locate_first_regex(string, pattern, opts_regex = opts(pattern)) : Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN) — Krishna, Feb 24 '18 at 05:50
I have no idea, I made this a reproducible example. Did you copy my code exactly? Because you didn't provide your data in an easily reproducible manner (including asterisks, arbitrary newlines, opening <<) it is hard to troubleshoot. Most likely something in the regex is not matching and creating a mismatched parenthesis. — Calum You, Feb 26 '18 at 22:51
Calum, Yes I copied the code exactly, including the raw_text. I am trying out a few things and will let you know. Thanks for responding. — Krishna, Feb 27 '18 at 05:12

R unnest with Sentence start and end positions

1 Answers1