2

I am trying to get the contents of href using Xpath code as described in these two posts. Unfortunately the code is returning the actual text "href" and several spaces in addition to the URL. How can I avoid that?

library(XML)

html <- readLines("http://www.msu.edu")
html.parse <- htmlParse(html)
Node <- getNodeSet(html.parse, "//div[@id='MSU-top-utilities']//a/@href")
Node[[1]]

# > Node[[1]]
#                  href 
# "students/index.html" 
# attr(,"class")
# [1] "XMLAttributeValue"
Community
  • 1
  • 1
Kevin M
  • 481
  • 6
  • 20

1 Answers1

5

It's just a named character vector. You can do:

as.character(Node[[1]])

which will give you

## [1] "students/index.html"

Alternately, here's a much better idiom in the xml2 package:

library(xml2)

doc <- read_html("http://www.msu.edu")
nodes <- xml_find_all(doc, "//div[@id='MSU-top-utilities']//a")
xml_attr(nodes, "href")

## [1] "students/index.html"      "faculty-staff/index.html" "alumni/index.html"       
## [4] "businesses/index.html"    "visitors/index.html"   
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205