I have a PDF with several 100 pages. The pdf contains press releases that have a different length (from 1 page to several pages).
Each press release, however, starts and ends with a similar structure:
Example of the head of one press release: OTS0071 5 AI 0339 MAA0001 Do, 14.Dez 2017
Example of the tail of the respective press release: 141028 Dez 17
Reading the pdf file into R is easy:
df <- readtext("ots.pdf", encoding = "UTF8")
Here is an example file:
structure(list(doc_id = "ots.pdf", text = "OTS0071 5 AI 0339 MAA0001 Do, 14.Dez 2017\n\nText of press release 1\n\n\n\nOTS0071 2017-12-14/10:28\n\n141028 Dez 17\n\n\n\n\nOTS0184 5 AI 0120 MAA0001 Di, 12.Dez 2017\n\nText of press release 2\n\n\n\nOTS0184 2017-12-12/15:46\n\n121546 Dez 17\n\n\n\n\nOTS0018 5 AI 0206 MAA0002 So, 10.Dez 2017\n\nText of press release 3\n\n\nOTS0018 2017-12-10/12:00\n\n101200 Dez 17\n"), row.names = c(NA,
-1L), class = c("readtext", "data.frame"))
But how can I tell R to read in every single press release as a new observation with the following three variables: ID, date, text
id
= the OTS number of the press release, in the example above it is OTS0071
date
= the date of the press release, in the example above it is Do, 14.Dez 2017 (i.e., Thursday 14 December 2017)
text
= the rest of the text between the head and the tail
I managed to extract all press releases and save them into a list with the following command:
x <- str_extract_all(df$text, "(OTS[0-9]{4})((.|\n)*?)([[:digit:]]{6} [[:alpha:]]{3} [[:digit:]]{2})")
But how can I transform x (a list) into a data frame and add the variables id
, date
, and text
?