htmlTreeParse to vector R

Question

I'm scraping data from web. I used readlines() but now I have to change it to getURL() and htmlTreeParse().

    a <- getURL(URL)
    b<-htmlTreeParse(a, encoding = "UTF-8")

Problem is that b$children$html$body returns null for me. Now I'm stuck at trying to get each line of parsed html into a vector.

I'll be thankful for every idea.

//edit

I am trying to scrape from this site

url<-"http://www.registeruz.sk/cruz-public/domain/accountingentity/show/1545622"

When I print var b code of the site looks readable and everything seems fine

//edit2

b$children$html['body']$body

seems closest to the solution

To be more clear, I would like to have the same output as after using readlines(). So each line of HTML is component of the vector

//final edit

  b <- htmlTreeParse(url, useInternalNodes=TRUE)
  html<-b["//body"][[1]]
  html<-as(html,"character")
  vectors<-strsplit(html,"\n")

This seems to created the same result, thanks everyone for your help

This really isn't much to go on. It would be better if you included a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). I would guess that your HTML page has a different structure than you are expecting but you haven't shown anything to confirm or deny that. — MrFlick, Oct 24 '15 at 20:44
Please post more of your code, including a value for `URL` so that we can even attempt to run it. Or shall we guess :) — Shawn Mehan, Oct 24 '15 at 20:47
Ok edited it. But I think the solution to my problem doesn't depend on site. or at least I hope so and the page should be fine — P.Belai, Oct 24 '15 at 20:49
You might want `as(b$children$html[["body"]], "character")` if you want a character vector. I'm not sure why `$body` doesn't work. Also you may need to `strsplit()[[1]]` that by `"[\t\r\n]+"` — Rich Scriven, Oct 24 '15 at 20:53
@jlhoward tried it but it still doesn't return vector, just a list — P.Belai, Oct 24 '15 at 20:56
`> b$children$html['body'] $body
3
attr(,"class") [1] "XMLNodeList"` — Shawn Mehan, Oct 24 '15 at 21:01
also, just as an aside, you can `b<-htmlTreeParse(URL, encoding = "UTF-8", asText = TRUE)` and avoid the local file, methinks. — Shawn Mehan, Oct 24 '15 at 21:02
Not clear what you mean by "get each line of parsed html into a vector". What do you mean by each line? Each element (tag)? Each instance of a certain tag? — jlhoward, Oct 24 '15 at 21:04
I also get very different behavior with `target <- url(url) readLines(target) readHTMLTable(url)` where I get different output. Finally, `curl http://www.registeruz.sk/cruz-public/domain/accountingentity/show/1545622 Request RejectedThe requested URL was rejected. Please consult with your administrator.

Your support ID is: 17677329063826983315` which is again different. — Shawn Mehan, Oct 24 '15 at 21:09
I am trying to get the same output readlines() generates. That means each line of html as a component of vector. I would like to get everything. Tags, text, etc. — P.Belai, Oct 24 '15 at 21:09
why aren't you using the xml/html processing capabilities of the XML pkg or xml2/rvest packages? — hrbrmstr, Oct 24 '15 at 21:11
@hrbrmstr I am pretty new to R and scraping and all my attempts failed — P.Belai, Oct 24 '15 at 21:21
You should post a question with what you are really trying to accomplish. There's ample evidence on SO that folks here are willing to explain scraping. — hrbrmstr, Oct 24 '15 at 21:32

score 1 · Accepted Answer · answered Oct 24 '15 at 20:55

1

Either of these should work:

url<-"http://www.registeruz.sk/cruz-public/domain/accountingentity/show/1545622"

b <- htmlTreeParse(url)
classs(b)
# [1] "XMLDocumentContent"
b$children$html["body"]

Or:

b <- htmlTreeParse(url, useInternalNodes=TRUE)
class(b)
# [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument"  "XMLAbstractDocument" 
b["//body"]

In the latter example b is a parsed XML document, and so can be indexed using xPath.

answered Oct 24 '15 at 20:55

jlhoward

58,004
7
97
140

Ok with a few changes it seems to work. I used[b"//body"][[1]] then saving it as character and then spliting into vectors. Thanks for your help – P.Belai Oct 24 '15 at 21:17
1

@P.Belai: can you please share how do you save it as character and then splitting into vectors? Thank you! – hoang tran Jun 19 '19 at 14:58

htmlTreeParse to vector R

1 Answers1