0

I want to import multiple pdf-files into R but per page there are 4 columns, a header/footer line and a table of contents.

enter image description here

For purpose of text mining I want to remove them from my file or character vector.

Right now I am using two functions to read in the files. The first one is pdf_text because it keeps the pages but can't deal with the 4 columns. The second one is extract_text, this one on its own doesn't keep the pages but can deal with the column structure (and is decently with occuring tables) .

But neither one of them is able to remove the table of contents (as far as I have tried).

My data set is not exactly minimal but otherwise I had some problems with the data structures. Here a working code:

    ################ relevant code ##############
library(pdftools)
library(tidyverse)
library(tabulizer)
files_name <- "Nachhaltigkeit 2021.pdf"
file_url <- c("https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/sustainability/documents/Allianz_Group_Sustainability_Report_2021-web.pdf", "https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/investor-relations/en/results-reports/annual-report/ar-2021/en-Allianz-Group-Annual-Report-2021.pdf")

reports_list <- lapply(file_url, pdf_text)

createTibble <- function(){
  tibble_together <- NULL
  #for all files
  for(i in 1:length(files_name)){
    
    page_nr <- length(reports_list[[i]])
    
    tib <- tibble(report = rep(files_name[i], page_nr), page = 1:page_nr, text = gsub("\r\n", " ", 
                  extract_text(files_name[[i]], pages = 1:page_nr)))
    tibble_together <- rbind(tibble_together, tib)
  }
  return(tibble_together)
}

reports_df <- createTibble()

############ code for problem visualization ###############
reports_df <- reports_df %>% unnest_tokens(output = word, input = text, token = "words")
#e.g this part contains the table of contents which is not intended
(reports_df %>% filter(page == 34, report == "Nachhaltigkeit 2021.pdf"))$word[832:885]

Thanks for your help in advance

PS: it's my first question so if you need sth. let me know. And I know that the function createTibble probably isn't optimal. But that's not my primary concern.

UeberQ
  • 1
  • 3
  • Can you include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and any code you've tried so far, even if it doesn't fully work? – jrcalabrese Jan 11 '23 at 17:26

0 Answers0