15

I have a lot of PDFs which are in two-column format. I am using the pdftools package in R. Is there a way to read each PDF according to the two-column format without cropping each PDF individually?

Each PDF consists of selectable text, and the pdf_text function has no problem reading the text, the only issue is that it will read the first line of the first column, then proceed to the next column, instead of moving down the first column.

Thank you very much in advance for your help.

A Newman
  • 25
  • 6
tsouchlarakis
  • 1,499
  • 3
  • 23
  • 44
  • 1
    I'm not aware of a function that reads two column pdfs. I think you have to write your own procedure that reads each line, separates each column per line, rbind() each line per column per page, then rbind() each column per page, then rbind() each page to have a complete dataset that reads in the order it was written. – Ryan Morton Mar 01 '17 at 21:20
  • That makes sense, the only issue is that R will read straight across a column and put only a space between words that are on either side. There is no way to differentiate that space from a normal space. – tsouchlarakis Mar 02 '17 at 00:48
  • See this webpage for another approach that falls along similar lines: http://blog.agileactors.com/blog/2017/9/5/how-to-extract-and-clean-data-from-pdf-files-in-r – Rich Pauloo Jan 29 '19 at 05:04

2 Answers2

13

There is a much easier way to do this using tabulizer::extract_text(file) function.

It works with PDF text contained in a single column and PDF text contained in 2+ columns. Yes, it's that simple!

11

I'd the same problem. What I did was to get the most frequent space values for each of my pdfs pages and stored it into a Vector. Then I sliced it using that value.

library(pdftools)
src <- ""
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

QTD_COLUMNS <- 2
read_text <- function(text) {
  result <- ''
  #Get all index of " " from page.
  lstops <- gregexpr(pattern =" ",text)
  #Puts the index of the most frequents ' ' in a vector.
  stops <- as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
  #Slice based in the specified number of colums (this can be improved)
  for(i in seq(1, QTD_COLUMNS, by=1))
  {
    temp_result <- sapply(text, function(x){
      start <- 1
      stop <-stops[i] 
      if(i > 1)            
        start <- stops[i-1] + 1
      if(i == QTD_COLUMNS)#last column, read until end.
        stop <- nchar(x)+1
      substr(x, start=start, stop=stop)
    }, USE.NAMES=FALSE)
    temp_result <- trim(temp_result)
    result <- append(result, temp_result)
  }
  result
}

txt <- pdf_text(src)
result <- ''
for (i in 1:length(txt)) { 
  page <- txt[i]
  t1 <- unlist(strsplit(page, "\n"))      
  maxSize <- max(nchar(t1))
  t1 <- paste0(t1,strrep(" ", maxSize-nchar(t1)))
  result = append(result,read_text(t1))
}
result
Felipe Santiago
  • 414
  • 6
  • 16
  • Thank you for your comment. I am getting an error in the line `stops <- as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))`. The error reports that `lstops` is not found. It is not defined before that. – tsouchlarakis Apr 06 '17 at 04:59
  • Sorry, was late in the night yesterday when I posted it. I tested and fixed it. Try it again. – Felipe Santiago Apr 06 '17 at 12:47
  • This is great! I have not been able to find anything like this on the internet. I hope this will help people moving forward. Small change, the line `i <- 2` in the for loop needs to be taken out. Otherwise, it will only print the second page, `length(txt)` times. – tsouchlarakis Apr 06 '17 at 18:37