1

I'm trying to scrape xml data into a dataframe from this website:

https://www.dmo.gov.uk/data/XmlDataReport?reportCode=D1A

I would like to achieve dt in the following format: enter image description here

However my code keeps throwing up errors all over the place: Such as: Error in open.connection(x, "rb") : Timeout was reached: [www.dmo.gov.uk] Connection timed out after 10001 milliseconds

Code is below:

library(data.table)
library(rvest)
library(xml2)

url <- read_html("https://www.dmo.gov.uk/data/XmlDataReport?reportCode=D1A")
dt <- rbindlist(lapply(url %>% html_nodes(css = "body > View_GILTS_IN_ISSUE > View_GILTS_IN_ISSUE") %>% 
                         xml_attrs(), 
                       function(x) as.data.table(t((x)))))
dt <- cbind(dt[,9, with = TRUE], 
            as.data.table(lapply(dt[,-9, with = TRUE], as.character)))
dt

Does anyone have any advice on how I can take this to completion?

Nad Pat
  • 3,129
  • 3
  • 10
  • 20
alec22
  • 735
  • 2
  • 12

2 Answers2

1

Was able to fix the issue with a combination of mkpt_uk's answer, and the one available here: Package "rvest" for web scraping https site with proxy

So downloading the file using:

download.file(url, destfile = destination)

followed by:

content <- read_xml(file)
alec22
  • 735
  • 2
  • 12
0

When I first tried I fell at the first hurdle of actually downloading the file. I was consistently getting the error message: fatal SSL/TLS alert is received (e.g. handshake failed)

I eventually found a solution here Error running weathercan package - fatal SSL/TLS alert (e.g. handshake failed)) which worked like a dream. I was then able to do the following:

library(tidyverse)
library(xml2)

x <- read_xml("http://www.dmo.gov.uk/data/XmlDataReport?reportCode=D1A")

zz <- x %>% xml_children() %>% xml_attrs() %>% enframe() %>% unnest_wider(value)

Which I think gives you exactly what you want except that it is a tibble rather than a data table.

enter image description here

mkpt_uk
  • 246
  • 1
  • 5
  • Curious, even when trying that method I get the "Error in open.connection(x, "rb") : Timeout was reached: [www.dmo.gov.uk] Connection timed out after 10000 milliseconds" message. – alec22 Mar 30 '22 at 10:56