How can I extract a pdf faster with tabulizer in R

Question

I have a table in a pdf file with more than 100000 rows and over 1900 pages which I decided to write into a .csv file with the R package tabulizer.

When I try to exctract the whole data from the pdf file with

pdf <- extract_tables("pdffile.pdf", method = "csv")

I get an error,

error in .jcall("rjavatools", "ljava/lang/object;", "invokemethod", cl, : java.lang.outofmemoryerror: gc overhead limit exceeded

Therefore I followed another approach.

What I did was to extract a page of the pdf one by one, and save the output as a .csv file.

1) get the number of pages of the pdf file

pdfPages <- length(get_page_dims("pdffile.pdf"))

2) Create a for loop to store a .csv file for each page.

for (i in 1:pdfPages) {
    page <- extract_tables("pdffile.pdf", pages = i, method = "data.frame")
    write.csv(page, file = paste(i,".csv", sep = ""))
}

3) Then created another loop for reading each file by one, and rbind it to the next one.

dataPdf  <- data.frame() # to rbind each .csv file
for (i in c(1:pdfPages)){
    page <- read.csv(paste(i,".csv", sep = ""))
    dataPdf <- bind_rows(dataQuilpue, page) 
}

I had to use bind_rows()from the dplyrpackage since not all .csv files ended with the same number of columns.

The result was more than satisfactory, though it took about 1.75 hours to complete, so I was thinking that maybe there is a better way to do it. Any ideas?

The thing is, that, in the first loop, the function that takes a lot of time is `extract_tables()`. The difference in time between running the loop with `write.csv` or `fwrite` is minimun, so my question is, if there is a way to make use of the tabulizer in a better way, tomake it faster. — csmontt, Apr 28 '17 at 14:50
you would do well to phrase your question more succinctly as such — MichaelChirico, Apr 28 '17 at 14:56
Can you give an example of a PDF for which you can to do this? — Emmanuel Hamel, Sep 26 '22 at 21:08

How can I extract a pdf faster with tabulizer in R

0 Answers0