Extract pages and structured content from pdf and save it to data-frame

Question

I have a PDF with several 100 pages. The pdf contains press releases that have a different length (from 1 page to several pages).

Each press release, however, starts and ends with a similar structure:

Example of the head of one press release: OTS0071 5 AI 0339 MAA0001 Do, 14.Dez 2017

Example of the tail of the respective press release: 141028 Dez 17

Reading the pdf file into R is easy:

df <- readtext("ots.pdf", encoding = "UTF8")

Here is an example file:

structure(list(doc_id = "ots.pdf", text = "OTS0071 5 AI 0339 MAA0001                            Do, 14.Dez 2017\n\nText of press release 1\n\n\n\nOTS0071                   2017-12-14/10:28\n\n141028 Dez 17\n\n\n\n\nOTS0184 5 AI 0120 MAA0001                           Di, 12.Dez 2017\n\nText of press release 2\n\n\n\nOTS0184 2017-12-12/15:46\n\n121546 Dez 17\n\n\n\n\nOTS0018 5 AI 0206 MAA0002                    So, 10.Dez 2017\n\nText of press release 3\n\n\nOTS0018 2017-12-10/12:00\n\n101200 Dez 17\n"), row.names = c(NA, 
-1L), class = c("readtext", "data.frame"))

But how can I tell R to read in every single press release as a new observation with the following three variables: ID, date, text

id = the OTS number of the press release, in the example above it is OTS0071

date= the date of the press release, in the example above it is Do, 14.Dez 2017 (i.e., Thursday 14 December 2017)

text = the rest of the text between the head and the tail

I managed to extract all press releases and save them into a list with the following command:

x <- str_extract_all(df$text, "(OTS[0-9]{4})((.|\n)*?)([[:digit:]]{6} [[:alpha:]]{3} [[:digit:]]{2})")

But how can I transform x (a list) into a data frame and add the variables id, date, and text?

score 1 · Answer 1 · answered Jul 30 '22 at 13:15

I think I finally solved it myself.

Required Packages:

require(pacman)

p_load(readtext,    # read files
       lubridate,   # work with date-times and time-spans
       plyr,        # Splitting, Applying and Combining Data
       tidyverse    # data manipulation and plotting
)

First, reading in the pdf:

df <- readtext("ots.pdf", encoding = "UTF8")

or use example data set:

df <- structure(list(doc_id = "ots.pdf", text = "OTS0071 5 AI 0339 MAA0001                            Do, 14.Dez 2017\n\nText of press release 1\n\n\n\nOTS0071                   2017-12-14/10:28\n\n141028 Dez 17\n\n\n\n\nOTS0184 5 AI 0120 MAA0001                           Di, 12.Dez 2017\n\nText of press release 2\n\n\n\nOTS0184 2017-12-12/15:46\n\n121546 Dez 17\n\n\n\n\nOTS0018 5 AI 0206 MAA0002                    So, 10.Dez 2017\n\nText of press release 3\n\n\nOTS0018 2017-12-10/12:00\n\n101200 Dez 17\n"), row.names = c(NA, 
-1L), class = c("readtext", "data.frame"))

Second, extracting the different press releases in the text:

x <- str_extract_all(df$text, "(OTS[0-9]{4})((.|\n)*?)([[:digit:]]{4} [[:alpha:]]{3} [[:digit:]]{2})")

Third, transforming the resulting list to a tibble data frame and give the column a name (i.e., "pressReleases"):

df_tibble <- as_tibble(x, "ots")
colnames(df_tibble) <- "pressReleases"

**Fourth, create the variables and remove variable "pressReleases":

df_tibble <- df_tibble %>% 
            mutate(date = str_extract(df_tibble$pressReleases, "[[:digit:]]{2}.[[:alpha:]]{3} [[:digit:]]{4}")) %>% 
            mutate(ots = str_extract(df_tibble$pressReleases, "OTS[0-9]{4}")) %>% 
            mutate(text = str_extract(df_tibble$pressReleases, "([[:digit:]]{2}.[[:alpha:]]{3} [[:digit:]]{4})((.|\n)*)")) %>% 
            select(-pressReleases)

Finally, remove "/n" and transform dates to date-format:

df_tibble$text <- gsub("\n"," ", df_tibble$text)
df_tibble$date <- dmy(df_tibble$date)

Extract pages and structured content from pdf and save it to data-frame

1 Answers1