tidytext read files from folder

Question

I'm trying to read a folder of pdf files into a dataframe in R. I'm able to read individual pdf files in using the pdftools library and pdf_text(filepath).

Ideally, I could grab the author and title of a series of pdf's that are then pushed into a dataframe that has a column with these so that I can then use basic tidytext functions on the text.

For a single file right now, I can just use:

library(pdftools)
library(tidytext)
library(dplyr)
txt <- pdf_text("filpath")
txt <- data_frame(txt)
txt %>%
     unnest_tokens(word, txt)

Here I have a dataframe with single words. I'd like to get to a dataframe where I have articles unpacked including a title and author column.

You can get a list of files with `files <- list.files(".pdf")` and then read them in to a list with `txtList <- sapply(files,pdf_text)`. Hopefully you can extract title/author from one of these too - although it is impossible to tell if you don't share an example of your data. — Andrew Gustar, May 30 '17 at 09:30
Andrew, thanks so much. I was actually able to get the texts in as a list with your suggestion, but had a hard time getting this to be a dataframe where I could tidy up the text. — jfkoehler, May 31 '17 at 00:20

Julia Silge · Accepted Answer · 2017-05-31T19:21:18.137

7

To find all the PDFs within a working directory, you can use list.files with an argument:

all_pdfs <- list.files(pattern = ".pdf$")

The all_pdfs object will then be a character vector that contains all your filenames.

Then, you can set up a pipe to read in all the PDFs and unnest them using tidytext with a map function from purrr. You can use a mutate() within the map() to annotate each line with the filename, if you'd like.

library(pdftools)
library(tidyverse)
library(tidytext)

map_df(all_pdfs, ~ data_frame(txt = pdf_text(.x)) %>%
    mutate(filename = .x) %>%
    unnest_tokens(word, txt))

You'll need to do some fancier work to get a title and author column, depending on where you have that information. Maybe with a regex on txt or filename before unnesting?

edited May 31 '17 at 19:21

answered May 30 '17 at 19:34

Julia Silge

10,848
2
40
48

Excellent, works great to read all in as text. How would I include the filename as a column? – jfkoehler May 31 '17 at 04:02
Great! I elaborated on this answer adding summary functions and stop words in this answer: https://stackoverflow.com/a/60321956/1839959 – Stan Feb 20 '20 at 15:58

maddocent · Answer 2 · 2018-04-14T22:25:25.270

May I suggest to add: basename(.x). This will remove the full path information if you use the full_names = TRUE option in list.files() as I did.

df <- map_df(all_pdfs[3:5], ~ data_frame(txt = pdf_text(.x)) %>%
    mutate(filename = basename(.x)) %>%
    unnest_tokens(word, txt))

Also if you experience any PDF parsing errors ..."Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure."..., you could try to create a safe version of the pdf_text() function with safe_pdf_text <- purrr::safely(pdf_text). For more information on using the {purrr} package for this, see e.g. this blog by Bruno Rodrigues http://www.brodrigues.co/blog/2017-03-24-lesser_known_purrr/

tidytext read files from folder

2 Answers2