I have a bunch of HICF forms (healthcare), and I am wanting to auto pull certain fields. Currently, I am able to have a bunch of pdfs in a directory. The code references them, and takes all the data and separates each line where there is an \n.
It then combines all the data sets into one file. The issue is, the data is still a bit messy, and different lines.
I would prefer to be able to say, "output text that is between "This word" and "that word". I will need to add code for this for about 9 outputs. I assumed I could use the rm_between function, but I am not sure how to incorporate.
I would like the output to find the text in between select words and export this data to the csv file.
How would you suggest upgrading this code?
install.packages("pdftools")
install.packages("tesseract")
install.packages("plyr")
install.packages("qpcR")
library(pdftools)
library(tesseract)
library (plyr)
library(qpcR)
text <- ocr("POC File 12.20 (3).pdf")
test2<-strsplit(text,"\n")
df <- ldply (test2, data.frame)
compile<-df
file_list <- list.files()
for (file in file_list){
text <- ocr(file)
test2<-strsplit(text,"\n")
df <- ldply (test2, data.frame)
compile<-qpcR:::cbind.na(compile,df)
}
write.csv(compile,"compiled.csv")