I'm trying to read a folder of pdf files into a dataframe in R. I'm able to read individual pdf files in using the pdftools
library and pdf_text(filepath)
.
Ideally, I could grab the author and title of a series of pdf's that are then pushed into a dataframe that has a column with these so that I can then use basic tidytext
functions on the text.
For a single file right now, I can just use:
library(pdftools)
library(tidytext)
library(dplyr)
txt <- pdf_text("filpath")
txt <- data_frame(txt)
txt %>%
unnest_tokens(word, txt)
Here I have a dataframe with single words. I'd like to get to a dataframe where I have articles unpacked including a title and author column.