0

I'm trying to parse this link using rvest but when I try

library(rvest)

link <- "https://mer.markit.com/br-reg/public/project.jsp?project_id=104000000028782"
page <- link %>% read_html()

I get the following error:

Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Failed to parse text
Anthony W
  • 1,289
  • 2
  • 15
  • 28

1 Answers1

1

Not sure about the why. My best guess is that there is some sort of javascript script that runs at load time to actually serve the page's content and that doesn't run if you use rvest to visit the page.

Here's a workaround using Rselenium

# load libraries
library(RSelenium)
library(rvest)

link <- "https://mer.markit.com/br-reg/public/project.jsp?project_id=104000000028782"


# start RSelenium ------------------------------------------------------------

rD <- rsDriver(browser="firefox", port=4549L, chromever = NULL)
remDr <- rD[["client"]]

# Navigate to webpage -----------------------------------------------------
remDr$navigate(link)


# pull the webpage html
html <- remDr$getPageSource()[[1]]

html %>% read_html()

{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Typ ...
[2] <body onunload="GUnload();">\n<script ...

Edit

Just for fun, here's how to click on the "details" button and then pull the information from the popup window that appears

# click on the more details button
remDr$findElement(using = "css",
                                value = ".btn")$clickElement()

# pull the page's html a second time
html2 <- remDr$getPageSource()[[1]]


# pull the text from the popup 

html2 %>% 
  read_html() %>% 
  html_node(".modal-content") %>% 
  html_node("dl") %>% 
  html_text()
[1] "Total Net Area (ha)\n\n50.32\nActively Eroding Blanket Bog (hagg/gully)\n\n0.05\nActively Eroding Blanket Bog (flat/bare)\n\n1.16\nDrained Blanket Bog (artificial)\n\n10.18\nDrained Blanket Bog (hagg/gully)\n\n38.93\nModified Blanket Bog\n\n0\nNear Natural Blanket Bog\n\n0\nActively Eroding Raised Bog (hagg/gully)\n\n0\nActively Eroding Raised Bog (flat/bare)\n\n0\nDrained Raised Bog (artificial)\n\n0\nDrained Raised Bog (hagg/gully)\n\n0\nModified Raised Bog\n\n0\nNear Natural Raised Bog\n\n0\nProject duration (years)\n\n30\nTotal predicted emission reductions over project lifetime (tCO2e)\n\n3283\nPredicted contribution to buffer over project lifetime (tCO2e)\n\n575\nPredicted claimable emission reductions over project lifetime (tCO2e)\n\n2708\n"
Russ
  • 1,385
  • 5
  • 17
  • Thanks. RSelenium is a nightmare to set-up on Linux! – Anthony W Mar 26 '23 at 17:50
  • I'm running it on Ubuntu! Did the code not work? – Russ Mar 26 '23 at 18:20
  • No `I get [1] "Connecting to remote server" Could not open firefox browser. Client error message: Undefined error in httr call. httr output: Failed to connect to localhost port 8090 after 0 ms: Connection refused` – Anthony W Mar 26 '23 at 18:35
  • ehmm yess it seems to always have issues haha. not sure if it'll help, but his is what helped me get it running: https://stackoverflow.com/a/74735571/16502170 – Russ Mar 26 '23 at 18:41