2

I'm dealing with PDFs in my research and I wrote a R scraper for some textdata. Everything works fine and I can read the data via:

library(pdftools)
library(tidyverse)

pdf_text("https://www.bankofengland.co.uk/-/media/boe/files/asset-purchase-facility/2009/2009-
q1.pdf") %>% 
read_lines()

In addition I want to exclude tables and footnotes by filtering by fontsize

pdf_data("https://www.bankofengland.co.uk/-/media/boe/files/asset-
purchase-facility/2009/2009-q1.pdf", font_info = T, opw = "", upw = 
"")[[2]] %>% 
  filter(font_size>=10) %>%
  group_by(y) %>% 
  summarise(text=paste(text,collapse =" ")) %>% 
  select(-y)  

This works good for the first two pages. However, the third page has two columns. Therfore the text is combined false. Is there any easy fix for this?

I saw Extract Text from Two-Column PDF with R but the function is just fixing the pdf_text output and cant be used with pdf_data I think?

Martin
  • 312
  • 2
  • 15

1 Answers1

0

pdftools::pdf_ocr_data() is a wrapper for tesseract, a well known OCR software (vignette). Using tesseract directly means we can use some/all of its many options (see tesseract_params()); one of which defines how many columns to expect within an image.

Here's what that could look like:

library(pdftools)
library(tesseract)

# Convert third page of pdf to an image
p3 <- pdftools::pdf_convert("./2009-q1.pdf", page = 3, format = 'tiff',  dpi = 600)

# Tell tesseract engine to expect english in two columns
eng <- tesseract(language = "eng", options = list(tessedit_pageseg_mode = 2))
text <- tesseract::ocr(p3, engine = eng)

# Print result
cat(text)

Alternatively, consider taking a look at tabulizer::extract_text(file) or defining the column dimensions directly.

edit

tabulizer::extract_text() can detect columns and extract text automatically:

library(tabulizer)
library(rJava)

t3 <- tabulizer::extract_text("./2009-q1.pdf")
cat(t3)

edit2

Use known column widths with pdftools like this, where each list is a page and each sublist is a column in a page:

# Define column x-value threshold
xcol <- 300

out <- list()
npages <- length(pdf_data("./2009-q1.pdf"))
for(i in 1:npages){
  temp <- list()
  # Column 1 data
  temp[[1]] <- pdf_data("./2009-q1.pdf", font_info = T, opw = "", upw = "")[[i]] %>% 
    filter(font_size>=10, x < xcol) %>%
    group_by(y) %>% 
    summarise(text=paste(text,collapse =" ")) %>% 
    select(-y)
  # Column 2 data
  temp[[2]] <- pdf_data("./2009-q1.pdf", font_info = T, opw = "", upw = "")[[i]] %>% 
    filter(font_size>=10, x > xcol) %>%
    group_by(y) %>% 
    summarise(text=paste(text,collapse =" ")) %>% 
    select(-y)
  out[[i]] <- temp
}

Or, if column width could vary, maybe try something like autothresholdr to estimate breaks in x.

Skaqqs
  • 4,010
  • 1
  • 7
  • 21
  • Unfortunately, this does not solve my problem. The approach would mean that I have to determine in advance which pages are 2-columned. This is impossible in my use case because I have too much data. – Martin Sep 22 '21 at 09:35
  • Ah, ok, makes sense. Were you able to try `tabulizer::extract_text()` ? It appears to detect the number of columns and extract text automatically on all three pages from the doc in your question. It might take more work to remove the tables and footnotes though. – Skaqqs Sep 22 '21 at 12:09
  • And this combination is exactly my problem :D – Martin Sep 23 '21 at 07:10