I'm dealing with PDFs in my research and I wrote a R scraper for some textdata. Everything works fine and I can read the data via:
library(pdftools)
library(tidyverse)
pdf_text("https://www.bankofengland.co.uk/-/media/boe/files/asset-purchase-facility/2009/2009-
q1.pdf") %>%
read_lines()
In addition I want to exclude tables and footnotes by filtering by fontsize
pdf_data("https://www.bankofengland.co.uk/-/media/boe/files/asset-
purchase-facility/2009/2009-q1.pdf", font_info = T, opw = "", upw =
"")[[2]] %>%
filter(font_size>=10) %>%
group_by(y) %>%
summarise(text=paste(text,collapse =" ")) %>%
select(-y)
This works good for the first two pages. However, the third page has two columns. Therfore the text is combined false. Is there any easy fix for this?
I saw Extract Text from Two-Column PDF with R but the function is just fixing the pdf_text output and cant be used with pdf_data
I think?