18

I would like to download a pdf file from the internet and save it in the local HD. After download, the pdf output file has lots of empty pages. What can I do to fix it?

Example:

require(XML)
url <- ('http://cran.r-project.org/doc/manuals/R-intro.pdf')
download.file(url, 'introductionToR.pdf')

Thanks in advance.

sealz
  • 5,348
  • 5
  • 40
  • 70
Diogo
  • 842
  • 2
  • 11
  • 15
  • 2
    I copied and pasted your code and got the 109 pages document as it should be. Maybe a problem iwth your PDF viewer? – vaettchen Feb 14 '12 at 16:23
  • works fine for me. (R 2.14.1, Linux -- could you post results of `sessionInfo()`? It does seem likely to be a viewer or some other OS issue, as this is pretty basic functionality ...) By the way, you don't need the `XML` package for this -- `download.file` is part of base R. – Ben Bolker Feb 14 '12 at 16:31
  • 1
    PS. I'm guessing you're on Windows: `?download.file` says: "Code written to download binary files must use ‘mode = "wb"’, but the problems incurred by a text transfer will only be seen on Windows." – Ben Bolker Feb 14 '12 at 16:33
  • I had the same problem as the OP. PDF downloaded would be corrupted. damn 'wb' parameter solved the problem – userJT Mar 12 '15 at 09:30

2 Answers2

49

Try with wb-mode like this:

download.file(url, 'introductionToR.pdf', mode="wb").

For me it works that way.

Sophia
  • 1,821
  • 2
  • 17
  • 19
-1

you can download pdfs and export tables as data.frame using tabulizer package

https://ropensci.org/tutorials/tabulizer_tutorial.html

install.packages("devtools")
# on 64-bit Windows
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"), INSTALL_opts = "--no-multiarch")
# elsewhere
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"))

library(tabulizer)

f2 <- "https://github.com/leeper/tabulizer/raw/master/inst/examples/data.pdf"
extract_tables(f2, pages = 1, method = "data.frame")
Selcuk Akbas
  • 711
  • 1
  • 8
  • 20