I have data frame of drugs (df
) and their associated information in a text
column with a number of headings (two of which are provided as examples). I need to split the text and have the according text in separate columns (as provided in the required
data frame)
heads <- c("Indications", "Administration")
df <- data.frame(drugs = c("acetaminophen", "prednisolone"), text = c("Indications1\nPain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.\nAdministration\nUsually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.", "Indications \nTreatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.\nAdministration\nGeneralDosage depends on the condition of indications and the patient response."))
required <- data.frame(drugs = c("acetaminophen", "prednisolone"), Indications = c(c("Pain\nSymptomatic relief of mild to moderate pain.Fever\nReduction of fever.Self-medication to reduce fever in infants, children, and adults.", "Treatment of a wide variety of diseases and conditions; used principally for glucocorticoid effects as an anti-inflammatory and immunosuppressant agent and for its effects on blood and lymphatic systems in the palliative treatment of various diseases.")), Administration = c("Usually administered orally; may be administered rectally as suppositories in patients who cannot tolerate oral therapy. Also may be administered IV.", "GeneralDosage depends on the condition of indications and the patient response."))
What I've tried
Using strsplit
This gives me a list but I don't have the headings and because of the fact that not all drug have all of the headings this doesn't work.
Also I don't know how to incorporate it into the existing df
library(rebus)
head.rx <- sapply(heads, function(x) as.regex(x) %R% free_spacing(any_char(0,3)) %R% newline(1,2)) %R% optional(space(0,3))
split <- strsplit(df$text[1], or1(head.rx), perl = T))
Getting start and end for each heading
To extract the text in between (sorry if it's very preliminary ... I'm not so good at custom functions)
extract_heading <- function(text){
#-1 is because I thought It would throw an error for the last heading
extract.list <- vector(mode = "list", length = length(heads)-1)
names(extract.list) <- heads[1:length(heads)-1]
for (i in 1:length(heads)-1) {
#the start and end regexes (based on the text to capture only the headings)
start <- as.regex(heads[i]) %R% free_spacing(any_char(0,3)) %R% newline(1,2)
end <- as.regex(heads[i+1]) %R% free_spacing(any_char(0,3)) %R% newline(1,2)
#the strings that need to be extracted (from one heading to the next)
rx <- start %R% free_spacing(any_char(3,5000)) %R% lookahead(end)
#extract
extract.list[i] <- stri_extract_first_regex(text, rx)
}
extract.list
}
##tried to see if it works (it gives me all NAs)
extract_heading(df$text[1])
Use the map
function
But can't figure out how to do it.
head.extract <- sapply(heads, function(x) x %R% free_spacing(any_char(3,9000)) %R% heads[which(heads ==x) +1])
purrr:: map2(df$text[1], head.extract, stri_extract_first_regex(df$text[1], head.extract))
I appreciate your help in advance.