0

I've been working on a webscraping project for the political science department at my university.

The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.

If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.

for the sake of an example I include some code:

final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"

to.save <- getURL(final.url)

p <- read_html(to.save)

normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)

tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")

type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))

Maybe you can help me with that

Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • I suggest you use an XML parser (e.g. [like so](https://stackoverflow.com/questions/17198658/how-to-parse-xml-to-r-data-frame)) [instead of regular expressions](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – mustaccio Dec 01 '18 at 16:02
  • Did exactly that: Stiller <- xpathSApply(doc = src, path = "//*[@id='ContentArea']/div[4]/div[1]/div/div[4]/span[2]", xmlValue, trim = TRUE) Problem is that the content of the div is changing. For some pages its the proposer of the bill, some pages later its the date. Since there is no structure in the page I still need to work with code that runs on matches rather than position. – Tim Runck Dec 04 '18 at 09:13

1 Answers1

0

My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.

library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm", 
                       "C:/.../danish.pdf")

text <- pdftools::pdf_text("C:/.../danish.pdf")

tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()

for(i in 1 : nb_Tomatch)
{
  # Locates the first hit of the regex
  # To locate all regex hit, use stringr::str_locate_all
  list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
  list_Text[[i]] <- stringr::str_sub(string = text, 
                                     start = list_Position[[i]][1, 1],
                                     end = list_Position[[i]][1, 2])
}

Here is another approach :

library(RDCOMClient)
library(stringr)
library(rvest)

url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)

Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()

tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()

for(i in 1 : nb_Tomatch)
{
  # Locates the first hit of the regex
  # To locate all regex hit, use stringr::str_locate_all
  list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
  list_Text[[i]] <- stringr::str_sub(string = text, 
                                     start = list_Position[[i]][1, 1],
                                     end = list_Position[[i]][1, 2])
}
Emmanuel Hamel
  • 1,769
  • 7
  • 19