Extract Text from Two-Column PDF with R

Question

I have a lot of PDFs which are in two-column format. I am using the pdftools package in R. Is there a way to read each PDF according to the two-column format without cropping each PDF individually?

Each PDF consists of selectable text, and the pdf_text function has no problem reading the text, the only issue is that it will read the first line of the first column, then proceed to the next column, instead of moving down the first column.

Thank you very much in advance for your help.

I'm not aware of a function that reads two column pdfs. I think you have to write your own procedure that reads each line, separates each column per line, rbind() each line per column per page, then rbind() each column per page, then rbind() each page to have a complete dataset that reads in the order it was written. — Ryan Morton, Mar 01 '17 at 21:20
That makes sense, the only issue is that R will read straight across a column and put only a space between words that are on either side. There is no way to differentiate that space from a normal space. — tsouchlarakis, Mar 02 '17 at 00:48
See this webpage for another approach that falls along similar lines: http://blog.agileactors.com/blog/2017/9/5/how-to-extract-and-clean-data-from-pdf-files-in-r — Rich Pauloo, Jan 29 '19 at 05:04

score 13 · Answer 1 · answered May 09 '19 at 01:11

13

There is a much easier way to do this using tabulizer::extract_text(file) function.

It works with PDF text contained in a single column and PDF text contained in 2+ columns. Yes, it's that simple!

answered May 09 '19 at 01:11

Cathryn Beeson-Lynch

131
1
3

1

Works like a charm, this is a great R package, thank you for your answer! – tsouchlarakis May 09 '19 at 17:52
It's always a great joy to see when somebody decides to handle challenges in the most simple fashion and just designs the necessary packages! – Anders Jørgensen May 23 '21 at 07:19
3

December 2021: tabulizer is no longer available in CRAN. – G5W Dec 13 '21 at 16:30
https://github.com/ropensci/tabulizer – Paul Jan 04 '22 at 17:04

Felipe Santiago · Accepted Answer · 2017-04-06T19:25:28.530

I'd the same problem. What I did was to get the most frequent space values for each of my pdfs pages and stored it into a Vector. Then I sliced it using that value.

library(pdftools)
src <- ""
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

QTD_COLUMNS <- 2
read_text <- function(text) {
  result <- ''
  #Get all index of " " from page.
  lstops <- gregexpr(pattern =" ",text)
  #Puts the index of the most frequents ' ' in a vector.
  stops <- as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
  #Slice based in the specified number of colums (this can be improved)
  for(i in seq(1, QTD_COLUMNS, by=1))
  {
    temp_result <- sapply(text, function(x){
      start <- 1
      stop <-stops[i] 
      if(i > 1)            
        start <- stops[i-1] + 1
      if(i == QTD_COLUMNS)#last column, read until end.
        stop <- nchar(x)+1
      substr(x, start=start, stop=stop)
    }, USE.NAMES=FALSE)
    temp_result <- trim(temp_result)
    result <- append(result, temp_result)
  }
  result
}

txt <- pdf_text(src)
result <- ''
for (i in 1:length(txt)) { 
  page <- txt[i]
  t1 <- unlist(strsplit(page, "\n"))      
  maxSize <- max(nchar(t1))
  t1 <- paste0(t1,strrep(" ", maxSize-nchar(t1)))
  result = append(result,read_text(t1))
}
result

Thank you for your comment. I am getting an error in the line `stops <- as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))`. The error reports that `lstops` is not found. It is not defined before that. — tsouchlarakis, Apr 06 '17 at 04:59
Sorry, was late in the night yesterday when I posted it. I tested and fixed it. Try it again. — Felipe Santiago, Apr 06 '17 at 12:47
This is great! I have not been able to find anything like this on the internet. I hope this will help people moving forward. Small change, the line `i <- 2` in the for loop needs to be taken out. Otherwise, it will only print the second page, `length(txt)` times. — tsouchlarakis, Apr 06 '17 at 18:37

Extract Text from Two-Column PDF with R

2 Answers2

Linked