I have a bunch of PDFs that I would like to search through in order to quickly locate tables and graphs relevant to my research.
#I load the following libraries
library(pdfsearch)
library(tm)
library(pdftools)
#I assign the directory of my PDF files to the path where they are located
directory <- '/References'
#and then I search the directory for the keywords "table", "graph", and "chart"
txt <- keyword_directory(directory,
keyword = c('table', 'graph', 'chart'),
split_pdf = TRUE,
remove_hyphen = TRUE,
full_names = TRUE)
#Up to this point everything works fine. I get a nice data.frame called "txt"
#with 1356 objects in 7 columns. However, when I try to search the data.frame
#I start running into trouble.
#I start with "hunter" a term that I know resides in the token_text column
txt[which(txt$token_text == 'hunter'), ]
#executing this code produces the following message
[1] ID pdf_name keyword page_num line_num line_text token_text
<0 rows> (or 0-length row.names)
Am I using the right tool to search through my data.frame? Is there an easier way to cross reference this data? Is there a package out there somewhere that is designed to help one crawl through a mountain of PDFs? Thanks for your time