Using R to turn structured data from a PDF into excel, code works, but need to refine

Question

I have a bunch of HICF forms (healthcare), and I am wanting to auto pull certain fields. Currently, I am able to have a bunch of pdfs in a directory. The code references them, and takes all the data and separates each line where there is an \n.

It then combines all the data sets into one file. The issue is, the data is still a bit messy, and different lines.

I would prefer to be able to say, "output text that is between "This word" and "that word". I will need to add code for this for about 9 outputs. I assumed I could use the rm_between function, but I am not sure how to incorporate.

I would like the output to find the text in between select words and export this data to the csv file.

How would you suggest upgrading this code?

install.packages("pdftools")
install.packages("tesseract")
install.packages("plyr")
install.packages("qpcR")

library(pdftools)
library(tesseract)
library (plyr)
library(qpcR)
text <- ocr("POC File 12.20 (3).pdf")
test2<-strsplit(text,"\n")
df <- ldply (test2, data.frame)
compile<-df



file_list <- list.files()
for (file in file_list){
 text <- ocr(file)
 test2<-strsplit(text,"\n")
 df <- ldply (test2, data.frame)
 compile<-qpcR:::cbind.na(compile,df)
}
write.csv(compile,"compiled.csv")

Welcome. Please see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for how to properly ask questions. It's hard to tell what you want to be able to do without a reproducible example. Please provide a `dput` of some sample code and the desired output to help us get started. — hmhensen, Dec 20 '18 at 23:42

score 0 · Answer 1 · answered Dec 21 '18 at 14:38

I like the stringr-package to extract parts of a text, which I think is what yo're looking for. I've also included some example-data, does this do what you want?

library(stringr)
mytextlines <- c("somedata_This word WantedData That word",
                 "NothingToExtractHere",
                 "somedata_other word WantedOtherData other close")
LookFor <- c(Tag1="This word *(.*?) *That word",
             Tag2="Other word *(.*?) *Other close")

found <- sapply(LookFor, function(look) {
  gsub(look, '\\1', str_extract(mytextlines, pattern=regex(look, ignore_case = TRUE)), ignore.case = TRUE)
})

It will output a matrix, with a row for each line of text, and a column for each tag you are looking for, and NA if nothing was found in this line.

The regular expressions are looking for something:

starting with "This word",
followed by any spaces,
followed by anything (but if it ends with space(s), then leave them for the next part),
followed by any spaces,
followed by "That word"

And the gsub replace those 5 elements with only the 3rd item (the part between parentheses)

Using R to turn structured data from a PDF into excel, code works, but need to refine

1 Answers1