How to scrape a downloaded PDF file with R

Question

I’ve recently gotten into scraping (and programming in general) for my internship, and I came across PDF scraping. Every time I try to read a scanned pdf with R, I can never get it to work. I’ve tried using the file.choose() function to no avail. Do I need to change my directory, or how can I get the pdf from my files into R? The code looks something like this:

    > library(pdftools)
    > text=pdf_text("C:/Users/myname/Documents/renewalscan.pdf")
    > text
    [1] ""

Also, using pdftables leads me here:

    > library(pdftables)
    > convert_pdf("C:/Users/myname/Documents/renewalscan.pdf","my.csv")
    Error in get_content(input_file, format, api_key) : 
    Bad Request (HTTP 400).

What are you trying to scrape it with? Is there non-image text in it to scrape, or is it in an image? This question isn't answerable without [a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — alistaire, Jun 07 '18 at 20:36
Apologies, I’m using pdftools and tm, and was trying to follow along with what’s said in http://medium.com/@CharlesBordet/how-to-extract-and-clean-data-from-pdf-files-I’m-r-da11964e252e. Normally, a file is downloaded from the web, but I have the file already on my computer. Also, it is a table in pdf form. — Thomas Campbell, Jun 07 '18 at 20:49
Possible duplicate of [Recognize PDF table using R](https://stackoverflow.com/questions/44141160/recognize-pdf-table-using-r) — jay.sf, Jun 07 '18 at 21:25
Similar thread here: https://stackoverflow.com/questions/51312453/pdftables-r-package-throwing-http-400-error/51312901#51312901 — mphil4, Feb 28 '19 at 11:19

score 4 · Answer 1 · answered Jun 07 '18 at 20:52

4

You should use the packages pdftools and pdftables.

If you are trying to read text inside the pdf, then use pdf_text() function. What goes inside is the path (in your computer or web) to the pdf. For example

tt = pdf_text("C:/Users/Smith/Documents/my_file.pdf")

It would be nice if you were more specif and also give us reproducible example.

answered Jun 07 '18 at 20:52

Giovana Stein

451
3
13

1

I apologize for the lack of clarity. This is my first post here, so I’m trying to get the hang of it all. Im editing the post now to show my code. – Thomas Campbell Jun 07 '18 at 20:59

score 0 · Answer 2 · answered Mar 29 '19 at 07:33

0

To use the PDFTables R package, you need to the run the following command:

convert_pdf('test/index.pdf', output_file = NULL, format = "xlsx-single", message = TRUE, api_key = "insert_API_key")

answered Mar 29 '19 at 07:33

mphil4

105
9

score 0 · Answer 3 · answered Sep 24 '19 at 15:51

If you are looking to get tabular data, you might try tabulizer. Here is a full code tutorial: https://www.business-science.io/code-tools/2019/09/23/tabulizer-pdf-scraping.html

Basically, you can use this code from the tutorial:

library(tabulizer)
extract_tables(
    file   = "2019-09-23-tabulizer/endangered_species.pdf", 
    method = "decide", 
    output = "data.frame")

How to scrape a downloaded PDF file with R

3 Answers3