Extracting strings from a PDF with R

Question

I have this PDF file from European parliament, that you can download here. I have downloaded it and put it in R. It contains lists of names of Members of European Parliament (MEP) after a session of vote.

I want to extract just bits of these lists. Specifically, I want to extract and put in a table the names situated between "AVGIVNA RÖSTER" and 0, see the text highlighted in this screenshot.

Similar series of names repeat in the PDF. It refers to specific votes. I want them all in a table. MEP's names change but the structure remains, they are always situated between the bits "AVGIVNA RÖSTER" and "0".

I thought of using a startswith function and and a for loop"but I struggle with the writing.

Here is what I did so far:

library(pdftools)
library(tidyverse)

votetext <- pdftools::pdf_text("MEP.pdf") %>%
  readr::read_lines()

Cettt · Accepted Answer · 2020-01-21T12:34:22.913

You could try something like this

votetext <- pdftools::pdf_text("MEP.pdf") %>%
  readr::read_lines()

a <- which(grepl("AVGIVNA RÖSTER", votetext)) #beginning of string
b <- which(grepl("^\\s*0\\s*$", votetext)) #end of string

sapply(a, function(x){paste(votetext[x:(min(b[b > x]))], collapse = ". ")})

Note that in the definition of b I use \\s* to find white space in a string. In general you could first remove trailing and leading white space, see this question.

In your case you could do:

votetext2 <- pdftools::pdf_text("data.pdf") %>%
  readr::read_lines() %>%
  str_remove("^\\s*") %>% #remove white space in the begining
  str_remove("\\s*$") %>% #remove white space in the end
  str_replace_all("\\s+", " ") #replace multiple white-spaces with a singe white-space

a2 <- which(votetext2 == "AVGIVNA RÖSTER")
b2 <- which(votetext2 == "0")

result <- sapply(a2, function(x){paste(votetext2[x:(min(b2[b2 > x]))], collapse = ". ")})

result then looks like this:

`"AVGIVNA RÖSTER. Martin Hojsík, Naomi Long, Margarida Marques, Pedro Marques, Manu Pineda, Ramona Strugariu, Marie Toussaint,. + Dragoş Tudorache, Marie-Pierre Vedrenne. -. Agnès Evren. 0"

Thank you! It really helps me. It gives a very good result, easy to clean for further use. — hug, Jan 21 '20 at 12:28

Extracting strings from a PDF with R

1 Answers1