Split text by headings as delimiters and save as dataframe columns in R

Question

I have data frame of drugs (df) and their associated information in a text column with a number of headings (two of which are provided as examples). I need to split the text and have the according text in separate columns (as provided in the required data frame)

heads <- c("Indications", "Administration")
df <- data.frame(drugs = c("acetaminophen", "prednisolone"), text = c("Indications1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\nAdministration\nUsually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.", "Indications \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\nAdministration\nGeneralDosage depends on the condition of indications and the patient response."))

required <- data.frame(drugs = c("acetaminophen", "prednisolone"), Indications = c(c("Pain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.", "Treatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.")), Administration = c("Usually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.", "GeneralDosage depends on the condition of indications and the patient response."))

What I've tried

Using `strsplit`

This gives me a list but I don't have the headings and because of the fact that not all drug have all of the headings this doesn't work. Also I don't know how to incorporate it into the existing df

library(rebus)

head.rx <- sapply(heads, function(x) as.regex(x) %R% free_spacing(any_char(0,3)) %R% newline(1,2)) %R% optional(space(0,3))
split <- strsplit(df$text[1], or1(head.rx), perl = T))

Getting start and end for each heading

To extract the text in between (sorry if it's very preliminary ... I'm not so good at custom functions)

extract_heading <- function(text){
  
  #-1 is because I thought It would throw an error for the last heading
  extract.list <- vector(mode = "list", length = length(heads)-1)
  names(extract.list) <- heads[1:length(heads)-1]
 
  for (i in 1:length(heads)-1) {
    
    #the start and end regexes (based on the text to capture only the headings)
    start <- as.regex(heads[i]) %R% free_spacing(any_char(0,3)) %R% newline(1,2)
    end <- as.regex(heads[i+1]) %R% free_spacing(any_char(0,3)) %R% newline(1,2)
    
    #the strings that need to be extracted (from one heading to the next)
    rx <- start %R% free_spacing(any_char(3,5000)) %R% lookahead(end)
    
    #extract
    extract.list[i] <- stri_extract_first_regex(text, rx)
  }
  extract.list
}
  
##tried to see if it works (it gives me all NAs)
extract_heading(df$text[1])

Use the `map` function

But can't figure out how to do it.

head.extract <- sapply(heads, function(x) x %R% free_spacing(any_char(3,9000)) %R% heads[which(heads ==x) +1])
purrr:: map2(df$text[1], head.extract, stri_extract_first_regex(df$text[1], head.extract))

I appreciate your help in advance.

What have you tried? including code like that will greatly enchance hte quality of your question. Try looking at the `strsplit` function. Or alternatively `str_split` from the `stringr` package. — Oliver, Feb 22 '22 at 11:53
thanks. I edited the question and added the codes I've tried @Oliver — ayeh, Feb 22 '22 at 13:48

score 2 · Answer 1 · answered Feb 28 '22 at 10:49

So let’s start with the main function and the regular expression. I would use stringi’s stri_extract_all_regex for this but stringr::str_extract_all() would also work if you find that easier. Or you can use regmatches with regexpr or gregexpr if you are determined to stay in base R (see here, for example).

My suggestion for the regular expression is shown below:

library(stringi)

stri_extract_all_regex(
  df$text,
  "(?<=Indications)[\\s\\S]+(?=Administration)"
)

## [[1]]
## [1] "1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\n"
## 
## [[2]]
## [1] " \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\n"

The individual parts:

(?<=Indications) is a lookbehind meaning it matches the position following ‘Indications’
[\\s\\S] matches any character (. matches any character except \n, which is essential here)
+? indicates we want at least 1 character matching [\\s\\S], we also use lazy matching to get shorter strings.
(?=Administration) is a lookbehind meaning it matches the position followed by ‘Administration’

This means we extract the string between and not including ‘Indications’ and ‘Administration’.

Next, we want to wrap this in a function to make it more flexible:

extract_between <- function(str, string_1, string_2) {
  unlist(stri_extract_all_regex(
    df$text,
    paste0("(?<=", string_1, ")[\\s\\S]+?(?=", string_2, ")")
  ))
}

The function extracts all characters between but not including string_1 and string_2. Try it out if you like.

Finally, we want to create a new column for each headline. I use a simple for loop for this. You could use lapply to maybe make it more efficient, but I did not test if that would improve anything and it makes the code less readable.

# for the final match, we need the $ which represents the end of the string
heads_new <- c(heads, "$")

for (i in seq_len(length(heads_new) - 1)) {
  df[[heads_new[i]]] <- extract_between(
    df$text,
    string_1 = heads_new[i],
    string_2 = heads_new[i + 1]
  )
}

# for nicer printing
tibble::as_tibble(df)

## # A tibble: 2 × 4
##   drugs         text              Indications           Administration          
##   <chr>         <chr>             <chr>                 <chr>                   
## 1 acetaminophen "Indications1\nP… "1\nPain\nSymptomati… "\nUsually administered…
## 2 prednisolone  "Indications \nT… " \nTreatment of a w… "\nGeneralDosage depend…

This assumes the headlines are in the correct order and you know that order. You can change the behaviour by using all headlines as string2 at the same time so matching stops as soon as R encounters another headline (this is the reason I use lazy mode, i.e. ?, above). I would say that will generally produce more issues as your headlines might occur elsewhere in the text, so I would prefer the first approach if possible:

extract_between(
  df$text,
  heads_new[1],
  paste0("(", paste0(heads_new, collapse = "|"), ")")
)

## [1] "1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\n"                                                                                                              
## [2] " \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\n"

Thank you for the help. your solution works, but the thing is that I have numerous headings (there may be a heading missing for one drug but they are all in the same order). Can you please change ```extract_between``` to extract text between multiple headings? — ayeh, Mar 02 '22 at 04:35
Do you mean to move the loop into the function? It should already work on many headlines if you use the for loop. — JBGruber, Mar 02 '22 at 08:15
Yes.. Sorry. you're right. I will accept your answer anyway because it would probably work for similar texts, but actually I got an error and I think it's because of the regex pattern. ```paste0("(?<=", string_1, ")[\\s\\S]+?(?=", string_2, ")")``` because it works on certain columns. I would be grateful if you edited the regex to match the the associate ```head.rx``` (mentioned in the question) for each heading (I'm not very good at regexes and it would take me ages to figure this out). — ayeh, Mar 02 '22 at 12:59
I don't know anything about the rebus package you are using. If you want a regex helper, maybe look into rex? I can't help you with that either though. If you share your error I can probably help you with that. I could imagine that some of your headlines use regex characters (e.g., `.`). If that is the case, you can use `rex::rex(heads)` to escape all problematic characters. — JBGruber, Mar 03 '22 at 10:08

score 2 · Answer 2 · answered Feb 28 '22 at 18:47

A base R option using strsplit

with(
  df,
  cbind(df,
  setNames(
  as.data.frame(
    do.call(rbind,strsplit(
    text,
    split = sprintf("(%s).*?\\n",paste0(heads,collapse = "|")),
    perl = TRUE
  ))[,-1]),
  heads))
)

gives

          drugs
1 acetaminophen
2  prednisolone

                                                                                                                                                                         text
1                                         Indications1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\nAdministration\nUsually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.
2 Indications \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on 
blood and lymphatic systems in the palliative treatment of various diseases.\nAdministration\nGeneralDosage depends on the condition of indications and the patient response.

                                                     Indications
1                                                                                                               Pain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\n
2 Treatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\n
                                                                                                                                         Administration
1 Usually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.
2                                                                       GeneralDosage depends on the condition of indications and the patient response.

Thank you for the help. Your solution worked but the problem was that it yielded a data frame with about three times as much columns as I have (especially for the last heading called **references** under which there are as many as 150 references). I will try to change the regex patterns to see if it works. — ayeh, Mar 02 '22 at 04:39

score 1 · Answer 3 · answered Mar 06 '22 at 18:11

A for loop option here (guess not so friendly).

but first:

Need to assume the heading names, its order, and that they do not repeat on the content.

if so:

n<-c("Indications","Administration")


df1<-df["drugs"]
df1[,n]<-NA



for (i in length(n):1){

#For the first heading
  if (i == 1){ 
    df1[,n[1]]<-df$text[grepl(n[1],df$text)]
    df1[,n[1]]<- gsub("\n"," ",df1[,n[1]])
    df1[,n[1]]<-sub(paste0(".*",n[1]," (.+)",n[2]," .*"),"\\1",df1[,n[1]])
    df1[,n[1]]<- gsub(n[1]," ",df1[,n[1]])
    
    }else{
      
      #For the last heading
      if (i == length(n)){ 
        df1[,n[length(n)]]<-df$text[grepl(n[length(n)],df$text)]
        df1[,n[length(n)]]<- gsub("\n"," ",df1[,n[length(n)]])
        df1[,n[length(n)]]<-sub(paste0(".*",n[length(n)]," (.+)"),"\\1",df1[,n[length(n)]])
        df1[,n[length(n)]]<- gsub(n[length(n)]," ",df1[,n[length(n)]])
        }else{
          
          #Remaining headings
          df1[,n[i]]<-df$text[grepl(n[i],df$text)]
          df1[,n[i]]<- gsub("\n"," ",df1[,n[i]])
          df1[,n[i]]<-sub(paste0(".*",n[i]," (.+)",n[i+1]," .*"),"\\1",df1[,n[i]])
          df1[,n[i]]<- gsub(n[i]," ",df1[,n[i]])
          }
    }
  }

Maybe not the most straightforward, but solving problems in different ways always opens the mind... Thank you for your help. — ayeh, Mar 07 '22 at 10:39

Split text by headings as delimiters and save as dataframe columns in R

What I've tried

Using strsplit

Getting start and end for each heading

Use the map function

3 Answers3

Using `strsplit`

Use the `map` function