1

I have a bunch of PDFs that I would like to search through in order to quickly locate tables and graphs relevant to my research.

#I load the following libraries
library(pdfsearch)
library(tm)
library(pdftools)

#I assign the directory of my PDF files to the path where they are located
directory <- '/References'

#and then I search the directory for the keywords "table", "graph", and "chart"
txt <- keyword_directory(directory,
 keyword = c('table', 'graph', 'chart'),
 split_pdf = TRUE,
 remove_hyphen = TRUE,
 full_names = TRUE)

#Up to this point everything works fine. I get a nice data.frame called "txt" 
#with 1356 objects in 7 columns. However, when I try to search the data.frame 
#I start running into trouble.

#I start with "hunter" a term that I know resides in the token_text column 
txt[which(txt$token_text == 'hunter'), ]

#executing this code produces the following message
[1] ID pdf_name keyword page_num line_num line_text token_text
<0 rows> (or 0-length row.names)

Am I using the right tool to search through my data.frame? Is there an easier way to cross reference this data? Is there a package out there somewhere that is designed to help one crawl through a mountain of PDFs? Thanks for your time

Ian Kemp
  • 28,293
  • 19
  • 112
  • 138
  • 4
    `which` is for exact matching. I suspect you want `grepl` or other regular-expression based tools. e.g.: `grepl("hunter", c("hunter2","other"))` vs. `which(c("hunter2","other") == "hunter")` – thelatemail May 16 '19 at 01:54
  • Or https://stackoverflow.com/questions/44759180/filter-by-multiple-patterns-with-filter-and-str-detect – Tung May 16 '19 at 02:44
  • @thelatemail I tried ```grepl("hunter", c("hunter2","other"))``` and just got ```[1] TRUE FALSE``` I'm trying to locate the precise row a given keyword appears. Also, why is it giving me a false response? – Travis Hamon May 16 '19 at 18:09
  • @Tung I tried ```txt %>% filter( str_detect(txt$token_text, "hunt|hunter|gather|forage|forager|gatherer") ) ``` and it simply printed out the entire data.frame minus the column token_text. – Travis Hamon May 16 '19 at 18:22
  • So, ```grepl("hunter", txt$token_text)``` will give me a TRUE or FALSE on whether the search term "hunter" appears on a given row, but it's difficult to tell what the corresponding row is for the result. Rather than a cloud of TRUE and FALSE it would be nice to have a row number to associate with each result. – Travis Hamon May 16 '19 at 18:59
  • you got TRUE for matching hunter to hunter2 but FALSE for matching hunter to other. You can use this sequence of TRUE/FALSE to subset a data.frame. – thelatemail May 16 '19 at 19:04
  • If you want the row number instead of TRUE/FALSE, then use `grep` instead of `grepl` – Maharero May 17 '19 at 00:45

1 Answers1

0

The which function returns TRUE or FALSE based on if the condition is met (for every value given in that condition, e.g. all values in a dataframe's column). You can subset a dataframe by inputing TRUE/FALSE values for the rows you want to keep / discard.

Combining this you get:
txt[which(txt$token_text == 'hunter'), ] which you did and got no rows returned. As was pointed out in the comments, which is for exact matching and you may have no exact matches.

Getting TRUE/FALSE based on partial matches or regex you can use the grepl function instead: txt[grepl("hunter", txt$token_text, ignore.case=TRUE), ]

For easier understanding I prefer doing this with the dplyr package:
library(dplyr) txt %>% filter(grepl("hunter",token_text, ignore.case=TRUE))

Maharero
  • 238
  • 1
  • 10