Background
I have a dataframe as below (made of synthetic data for those who are interested). It consists of semi-structured text. The text is separated by headers. The header titles are always the same but some of the headers are sometimes not present in the report (but all occur in the same order).
The Data
structure(list(OGDReportWhole = c("Hospital: Random NHS Foundation Trust\nHospital Number: J6044658\nPatient Name: Jargon, Victoria\nGeneral Practitioner: Dr. Martin, Marche\nDate of procedure: 2009-11-11\nEndoscopist: Dr. Sullivan, Shelby\nSecond endoscopist: Dr. al-Basha, Mahfoodha\nMedications: Fentanyl 12.5mcg\nMidazolam 6mg\nInstrument: FG5\nExtent of Exam: GOJ\nIndications: Follow-up ULCER HEALING\nProcedure Performed: Gastroscopy (OGD)\nFindings: No evidence of Barrett's oesophagus, short 2 cn hiatus hernia.,Oesophageal biopsies taken from three levels as requested.,OGD today to assess for ulceration/ongoing bleeding.,Diaphragmatic pinch:40cm .,She has a small hiatus hernia .,We will re-book for 2 weeks, rebanding.,Tiny erosions at the antrum.,Biopsies taken from top of stricture-metal marking clips in situ.,The varices flattened well with air insufflation.,He is on Barrett's Screeling List in October 2017 at St Thomas'.\nHALO 90 done with good effect\nEndoscopic Diagnosis: Post chemo-radiotherapy stricture ",
"Hospital: Random NHS Foundation Trust\nHospital Number: Y6417773\nPatient Name: Powell, Destiny\nGeneral Practitioner: Dr. al-Safi, Lutfiyya\nDate of procedure: 2008-06-15\nEndoscopist: Dr. Kekich, Annabelle\nSecond endoscopist: Dr. Needham, April\nMedications: Fentanyl 125mcg\nMidazolam 7mg\nInstrument: FG6\nExtent of Exam: Pylorus\nIndications: Weight Loss\nProcedure Performed: Gastroscopy (OGD)\nFindings: Duodenum: Duodenitis with a small erosion .,STOMACH: diffuse gastritis with angiodysplasia and punctate bleeding site on greater curve mid body - no obvious ulcer- antrum scar ?,No immediate complications.,Z-line at: 38cm - Bravo placed at 32cm- good positionat check endoscopy.\n\nEndoscopic Diagnosis: Esophageal candidiasis "
)), row.names = 1:2, class = "data.frame")
My current solution
I created a function that extracts the text based on a list of character delimiters (the names of the headers).
At the moment it takes the dataframe which the text is held in (x) and also the text column(y), as well as the start header and the end header, and finally it creates the column header (which is the start header).
This works OK I think:
#' @param x the dataframe
#' @param y the column to extract from
#' @param stra the start of the boundary to extract
#' @param strb the end of the boundary to extract
#' @param t the column name to create
Extractor2 <- function(x, y, stra, strb, t) {
x <- data.frame(x)
t <- gsub("[^[:alnum:],]", " ", t)
t <- gsub(" ", "", t, fixed = TRUE)
x[, t] <- stringr::str_extract(x[, y], stringr::regex(paste(stra,
"(.*)", strb, sep = ""), dotall = TRUE))
x[, t] <- gsub("\\\\.*", "", x[, t])
names(x[, t]) <- gsub(".", "", names(x[, t]), fixed = TRUE)
x[, t] <- gsub(" ", "", x[, t])
x[, t] <- gsub(stra, "", x[, t], fixed = TRUE)
if (strb != "") {
x[, t] <- gsub(strb, "", x[, t], fixed = TRUE)
}
x[, t] <- gsub(" ", "", x[, t])
x[, t]<- ColumnCleanUp(x[, t])
return(x)
}
And I run this iteratively:
EndoscTree<-list('Hospital Number:','Patient Name:','General Practitioner:',
'Date of procedure:','Endoscopist:','Second Endoscopist:','Medications',
'Instrument','Extent of Exam:','Indications:','Procedure Performed:',
'Findings:','Endoscopic Diagnosis:')
for(i in 1:(length(EndoscTree)-1)) {
Mydata<-Extractor2(Mydata,'OGDReportWhole',as.character(EndoscTree[i]),
as.character(EndoscTree[i+1]),as.character(EndoscTree[i]))
}
The problem
I would like instead for the function to just take a character string (instead of a dataframe and then the column name) and then add it to an empty dataframe (including the original character string).
I'm not sure how to convert the function from taking a dataframe and adding to that, to adding an inputString to an empty dataframe. I would like it to create the same output as the current function.
I'm happy to take criticism of the function in general and if there is a better way to achieve what I am trying.
The answer
OK thanks to @M-M; I was being a bit slow.
The answer is easy. Just use the delimiter list to create an empty dataframe and go from there:
df <- data.frame(matrix(ncol = length(EndoscTree), nrow=nrow(Mydata))
colnames(df)<-EndoscTree