I have a table in a pdf file with more than 100000 rows and over 1900 pages which I decided to write into a .csv file with the R package tabulizer
.
When I try to exctract the whole data from the pdf file with
pdf <- extract_tables("pdffile.pdf", method = "csv")
I get an error,
error in .jcall("rjavatools", "ljava/lang/object;", "invokemethod", cl, : java.lang.outofmemoryerror: gc overhead limit exceeded
Therefore I followed another approach.
What I did was to extract a page of the pdf one by one, and save the output as a .csv file.
1) get the number of pages of the pdf file
pdfPages <- length(get_page_dims("pdffile.pdf"))
2) Create a for loop to store a .csv file for each page.
for (i in 1:pdfPages) {
page <- extract_tables("pdffile.pdf", pages = i, method = "data.frame")
write.csv(page, file = paste(i,".csv", sep = ""))
}
3) Then created another loop for reading each file by one, and rbind it to the next one.
dataPdf <- data.frame() # to rbind each .csv file
for (i in c(1:pdfPages)){
page <- read.csv(paste(i,".csv", sep = ""))
dataPdf <- bind_rows(dataQuilpue, page)
}
I had to use bind_rows()
from the dplyr
package since not all .csv files
ended with the same number of columns.
The result was more than satisfactory, though it took about 1.75 hours to complete, so I was thinking that maybe there is a better way to do it. Any ideas?