1

I want to use R to efficiently extract tabular data from thousands of PDF documents. I would typically convert the PDF data to text strings and then extract information by position, but these specific tables are often missing data, as shown in the example below. The location of the missing data varies between documents. Can anyone suggest a method for doing this?

Example of the type of PDF

1

大陸北方網友
  • 3,696
  • 3
  • 12
  • 37

2 Answers2

2

There are two packages which I use for this. Which is better depends on what exactly you need to do. Let's say your table is on pages 10-16 of a PDF:

  1. You should be able to extract the data from said pages using the tabulizer package:

    tab <- tabulizer::extract_tables(file = "path/file.pdf", pages = 10:16)

  2. If you only want the text, you should use pdftools which is a lot faster:

    text <- pdftools::pdf_text("path/file.pdf")[10:16]

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • `tabulizer` often works pretty well but there are no guarantees. [Sometimes it gets quite complicated](https://stackoverflow.com/a/55500376/5028841). `rJava` is an absolute pain to get running. There are tons of how-tos though depending on your operating system. Often simply restarting your computer does the trick. – JBGruber Sep 07 '20 at 12:57
  • Thanks for the response. To clarify: Can tabulizer recognize tables where some cell values are missing? In the example above, columns 5 to 7 are missing data in row 2. In the PDFs this isn't always the case. So, will tabulizer allow me to specify that this table that has 11 columns of data, some of which are blank at times? I stumbled on tabilizer earlier and tried to experiment with it, but it kept throwing an error along the lines of "JVM could not be found". I installed java and rJava but was unable to correct the issue. – Nicholas George Sep 07 '20 at 13:02
  • I don't know. If you share the PDF, I can test it. Generally [that's a good approach to get answered tailored exactly to your question.](http://stackoverflow.com/questions/5963269) – JBGruber Sep 07 '20 at 13:55
  • I'm embarrassed to ask, but how does one share a PDF in Stack Overflow? – Nicholas George Sep 07 '20 at 14:03
  • Don't be embarrassed. Asking questions is fine. You simply upload it somewhere and share the link. pdfhost.io looks like a good option. – JBGruber Sep 07 '20 at 14:34
  • Thanks for your patience! The two documents at these links are examples. https://pdfhost.io/v/noYXw.8hu_WMaA18BRIM3pdf.pdf and https://pdfhost.io/v/Dfa1rVl~S_WEaA18JAMB4pdf.pdf . The table "Soil Tests" appears in both, and in one it is complete, and in the other, it has missing data. – Nicholas George Sep 07 '20 at 23:00
0

The following solution only works on Windows. I started from the image above. With the code below, I have been able to extract the table :

library(RDCOMClient)
library(magick)

################################################
#### Step 1 : We convert the image to a PDF ####
################################################

path_PDF <- "C:\\temp.pdf"
path_PNG <- "C:\\lP3hw.png"
path_Word <- "C:\\temp.docx"

pdf(path_PDF, width = 16, height = 6)
im <- image_read(path_PNG)
plot(im)
abline(h = 50, col = "black")
abline(h = 100, col = "black")
abline(h = 130, col = "black")
abline(h = 260, col = "black")
dev.off()

####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF in word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE

doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
                                   ConfirmConversions = FALSE)

doc$SaveAs2(path_Word)

##############################################################
#### Step 3 : We extract the table from the word document ####
##############################################################

nb_Row <- doc$tables(1)$Rows()$Count()
nb_Col <- doc$tables(1)$Columns()$Count()
mat_Temp <- matrix(NA, nrow = nb_Row, ncol = nb_Col)

for(i in 1 : nb_Row)
{
  for(j in 1 : nb_Col)
  {
    mat_Temp[i, j] <- tryCatch(doc$tables(1)$cell(i, j)$range()$text(), error = function(e) NA)
  }
}

mat_Temp

[,1]              [,2]            [,3]                                                                            [,4]          [,5]       [,6]      
[1,] "\r\a"            "/\r\a"         NA                                                                              NA            NA         NA        
[2,] "\r\a"            "/\r\a"         "1 sand, 2\tmg/kg\tmglkg\tpH\tpH\tdS/m sandy loam, 3 loam, 4 loamy clay, 5 clay\r\a" NA            NA         NA        
[3,] "30 May 2018\r\a" "0-10\r\a"      "\t520\t23.00\r\a"                                                                "Colwell\r\a" "7_09\r\a" "6_70\r\a"
[4,] "30 May 2018\r\a" "\t10-60\t50\r\a" "9.0\r\a"                                                                       "\r\a"        "8_50\r\a" "7_80\r\a"
     [,7]      [,8]      
[1,] NA        NA        
[2,] NA        NA        
[3,] "0.1\r\a" "0.93\r\a"
[4,] "0.1\r\a" "3.3\r\a" 
Emmanuel Hamel
  • 1,769
  • 7
  • 19