7

I have a pdf file with multiple pages, but I am interested in only a subgroup of them. For example, my original PDF has 30 pages and I want only the pages 10 to 16.

I tried using the function split_pdf from tabulizer package, that only splits the pdf page to page (resulting in 200 files, one for each page), followed by merge_pdfs(which merge pdf files). It worked properly, but is taking ages (and I have around 2000 pdf files I have to split).

This is the code I am using:

split = split_pdf('file_path')

start = 10
end = 16

merge_pdfs(split[start:end], 'saving_path')

I couldn't find any better option to do this. Any help would appreciated.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Giovana Stein
  • 451
  • 3
  • 13
  • Maybe check out `pdftools` package if you haven't already. Haven't used it myself, but it is a common recommendation. Second, if this is not eating up too much memory, you might try running your split/merge combo through a parallel process. See packages `parallel` or `foreach`. You may be able to run through a number of these at the same time. – lmo Mar 16 '18 at 21:44
  • I am already using a for loop, the problem is that the split_pdf is taking too long, because my pdf files are big! I would like to have a function where I could input the start and end pages, in order to skip splitting page by page. – Giovana Stein Mar 16 '18 at 22:06

3 Answers3

5

Unfortunatly, I find it a bit unclear what kind of data is in your PDF and what you are trying to extract from it. So I outline two approaches.

  1. If you have tables in the pdf, you should be able to extract the data from said pages using using:

    tab <- tabulizer::extract_tables(file = "path/file.pdf", pages = 10:16)

  2. If you only want the text, you should use pdftools which is a lot faster:

    text <- pdftools::pdf_text("path/file.pdf")[10:16]

JBGruber
  • 11,727
  • 1
  • 23
  • 45
2

Install pdftk (if you don't already have it). Assuming it is on your path and myfile.pdf is in the current directory run this from R:

system("pdftk myfile.pdf cat 10-16 output myfile_10to16.pdf")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
2

As an accessory to G.Grothendieck's answer, one could also use the package staplr, which is an R wrapper around the program pdftk:

library('staplr')

staplr::select_pages(
    selpages = 10:16,
    input_filepath = 'file_path',
    output_filepath = 'saving_path')

In my experience, plain pdftk works faster. But, if you need to do something complex and you are more familiar with R syntax than with bash syntax, using the staplr package will save on coding time.

jorvaor
  • 73
  • 1
  • 2
  • 10