2

I’ve recently gotten into scraping (and programming in general) for my internship, and I came across PDF scraping. Every time I try to read a scanned pdf with R, I can never get it to work. I’ve tried using the file.choose() function to no avail. Do I need to change my directory, or how can I get the pdf from my files into R? The code looks something like this:

    > library(pdftools)
    > text=pdf_text("C:/Users/myname/Documents/renewalscan.pdf")
    > text
    [1] ""

Also, using pdftables leads me here:

    > library(pdftables)
    > convert_pdf("C:/Users/myname/Documents/renewalscan.pdf","my.csv")
    Error in get_content(input_file, format, api_key) : 
    Bad Request (HTTP 400).
Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
  • 2
    What are you trying to scrape it with? Is there non-image text in it to scrape, or is it in an image? This question isn't answerable without [a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – alistaire Jun 07 '18 at 20:36
  • Apologies, I’m using pdftools and tm, and was trying to follow along with what’s said in http://medium.com/@CharlesBordet/how-to-extract-and-clean-data-from-pdf-files-I’m-r-da11964e252e. Normally, a file is downloaded from the web, but I have the file already on my computer. Also, it is a table in pdf form. – Thomas Campbell Jun 07 '18 at 20:49
  • Possible duplicate of [Recognize PDF table using R](https://stackoverflow.com/questions/44141160/recognize-pdf-table-using-r) – jay.sf Jun 07 '18 at 21:25
  • Similar thread here: https://stackoverflow.com/questions/51312453/pdftables-r-package-throwing-http-400-error/51312901#51312901 – mphil4 Feb 28 '19 at 11:19

3 Answers3

4

You should use the packages pdftools and pdftables.

If you are trying to read text inside the pdf, then use pdf_text() function. What goes inside is the path (in your computer or web) to the pdf. For example

tt = pdf_text("C:/Users/Smith/Documents/my_file.pdf")

It would be nice if you were more specif and also give us reproducible example.

Giovana Stein
  • 451
  • 3
  • 13
  • 1
    I apologize for the lack of clarity. This is my first post here, so I’m trying to get the hang of it all. Im editing the post now to show my code. – Thomas Campbell Jun 07 '18 at 20:59
0

To use the PDFTables R package, you need to the run the following command:

convert_pdf('test/index.pdf', output_file = NULL, format = "xlsx-single", message = TRUE, api_key = "insert_API_key")
mphil4
  • 105
  • 9
0

If you are looking to get tabular data, you might try tabulizer. Here is a full code tutorial: https://www.business-science.io/code-tools/2019/09/23/tabulizer-pdf-scraping.html

Basically, you can use this code from the tutorial:

library(tabulizer)
extract_tables(
    file   = "2019-09-23-tabulizer/endangered_species.pdf", 
    method = "decide", 
    output = "data.frame")
Matt Dancho
  • 6,840
  • 3
  • 35
  • 26