Extracting text data from PDF files

Question

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?

In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.

Any suggestions?

although the question is only vaguely related, the answer points out some interesting problem with text extraction from PDF files: http://stackoverflow.com/questions/2732178/extracting-text-from-pdf-with-poppler-c — nico, Oct 04 '10 at 05:11
Thanks Nico. Fortunately, the particular PDF I am working with are very simple text files, so hopefully this will be less of an issue. — DrewConway, Oct 04 '10 at 14:38

score 30 · Answer 1 · answered Jul 06 '16 at 08:08

30

This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.

answered Jul 06 '16 at 08:08

Remko Duursma

2,741
17
24

1

This package is really the easiest way to get text from PDFs using R at the moment. – Ben Sep 07 '16 at 06:44
1

yes this thread was before pdftools. same for me - pretty useful tool. makes it relatively easy to extract even tables out of pdf files. – davidski Dec 01 '16 at 10:03

score 30 · Accepted Answer · answered Oct 04 '10 at 01:56

30

Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.

That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.

answered Oct 04 '10 at 01:56

Dirk Eddelbuettel

360,940
56
644
725

10

You were right to suggest the text mining packages. The `tm` package has `readPDF`, which seems to do exactly what I need. – DrewConway Oct 04 '10 at 02:46
1

And here is `pdftotext` in an `R` workflow, for those unfamiliar with Linux: http://stackoverflow.com/a/19926301/1036500 – Ben Jul 02 '14 at 18:24
can we extract text data from PDF files into html format – Moby M Aug 04 '17 at 09:03
.@DirkEddelbuettel - Is there a way to read specific page of the PDF rather than full PDF? – Chetan Arvind Patil Oct 16 '18 at 18:19
You could use a pdf command-line tool to extract a particular page first and then read that. – Dirk Eddelbuettel Oct 16 '18 at 18:20
@DirkEddelbuettel - Yes. I want to do everything with `R`. I am using `pdftools()` library, but haven't found quick way to read specific page using this package. – Chetan Arvind Patil Oct 16 '18 at 18:22

score 9 · Answer 3 · answered Aug 05 '13 at 17:48

9

A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.

answered Aug 05 '13 at 17:48

NiuBiBang

628
1
15
30

1

If you are on Windows then you may use this free utility especially designed for the tables extraction from PDF https://bytescout.com/products/pdfmultitool/index.html (Disclosure: I worked under this utility) – Eugene Jan 05 '16 at 21:00

score 9 · Answer 4 · answered Jun 06 '16 at 22:27

A purely R solution could be:

library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file), 
      readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])

then you'll have pdf lines in an array.

score 6 · Answer 5 · edited Sep 20 '18 at 04:18

6

install.packages("pdftools")
library(pdftools)


download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf", 
              "56901.DEN.Gamebook", mode = "wb")

txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])

edited Sep 20 '18 at 04:18

Tung

26,371
7
91
115

answered May 29 '17 at 21:41

DataProphets

156
3
17

score 5 · Answer 6 · answered May 02 '16 at 13:34

The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.

The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.

Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.

Data can be extracted from multiple pages, and a different area can be specified for each page, if required.

For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.

score 2 · Answer 7 · answered Mar 07 '16 at 23:08

I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information

Set path to pdftotxt.exe and convert pdf to text

exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"

for(i in 1:length(pdfFracList)){
    fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
    pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
    txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
    print(paste0("File number ", i, ", Processing file ", pdfSource))
    system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

Extracting text data from PDF files

7 Answers7

Linked

Related