1

I'm scraping data from web. I used readlines() but now I have to change it to getURL() and htmlTreeParse().

    a <- getURL(URL)
    b<-htmlTreeParse(a, encoding = "UTF-8")

Problem is that b$children$html$body returns null for me. Now I'm stuck at trying to get each line of parsed html into a vector.

I'll be thankful for every idea.

//edit

I am trying to scrape from this site

url<-"http://www.registeruz.sk/cruz-public/domain/accountingentity/show/1545622"

When I print var b code of the site looks readable and everything seems fine

//edit2

b$children$html['body']$body

seems closest to the solution

To be more clear, I would like to have the same output as after using readlines(). So each line of HTML is component of the vector

//final edit

  b <- htmlTreeParse(url, useInternalNodes=TRUE)
  html<-b["//body"][[1]]
  html<-as(html,"character")
  vectors<-strsplit(html,"\n")

This seems to created the same result, thanks everyone for your help

P.Belai
  • 33
  • 1
  • 6
  • This really isn't much to go on. It would be better if you included a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). I would guess that your HTML page has a different structure than you are expecting but you haven't shown anything to confirm or deny that. – MrFlick Oct 24 '15 at 20:44
  • Please post more of your code, including a value for `URL` so that we can even attempt to run it. Or shall we guess :) – Shawn Mehan Oct 24 '15 at 20:47
  • 1
    Try `b$children$html["body"]`. – jlhoward Oct 24 '15 at 20:49
  • Ok edited it. But I think the solution to my problem doesn't depend on site. or at least I hope so and the page should be fine – P.Belai Oct 24 '15 at 20:49
  • You might want `as(b$children$html[["body"]], "character")` if you want a character vector. I'm not sure why `$body` doesn't work. Also you may need to `strsplit()[[1]]` that by `"[\t\r\n]+"` – Rich Scriven Oct 24 '15 at 20:53
  • @jlhoward tried it but it still doesn't return vector, just a list – P.Belai Oct 24 '15 at 20:56
  • `> b$children$html['body'] $body

    3

    attr(,"class") [1] "XMLNodeList"`
    – Shawn Mehan Oct 24 '15 at 21:01
  • also, just as an aside, you can `b<-htmlTreeParse(URL, encoding = "UTF-8", asText = TRUE)` and avoid the local file, methinks. – Shawn Mehan Oct 24 '15 at 21:02
  • Not clear what you mean by "get each line of parsed html into a vector". What do you mean by each line? Each element (tag)? Each instance of a certain tag? – jlhoward Oct 24 '15 at 21:04
  • I also get very different behavior with `target <- url(url) readLines(target) readHTMLTable(url)` where I get different output. Finally, `curl http://www.registeruz.sk/cruz-public/domain/accountingentity/show/1545622 Request RejectedThe requested URL was rejected. Please consult with your administrator.

    Your support ID is: 17677329063826983315` which is again different.
    – Shawn Mehan Oct 24 '15 at 21:09
  • I am trying to get the same output readlines() generates. That means each line of html as a component of vector. I would like to get everything. Tags, text, etc. – P.Belai Oct 24 '15 at 21:09
  • why aren't you using the xml/html processing capabilities of the XML pkg or xml2/rvest packages? – hrbrmstr Oct 24 '15 at 21:11
  • @hrbrmstr I am pretty new to R and scraping and all my attempts failed – P.Belai Oct 24 '15 at 21:21
  • You should post a question with what you are really trying to accomplish. There's ample evidence on SO that folks here are willing to explain scraping. – hrbrmstr Oct 24 '15 at 21:32

1 Answers1

1

Either of these should work:

url<-"http://www.registeruz.sk/cruz-public/domain/accountingentity/show/1545622"

b <- htmlTreeParse(url)
classs(b)
# [1] "XMLDocumentContent"
b$children$html["body"]

Or:

b <- htmlTreeParse(url, useInternalNodes=TRUE)
class(b)
# [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument"  "XMLAbstractDocument" 
b["//body"]

In the latter example b is a parsed XML document, and so can be indexed using xPath.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Ok with a few changes it seems to work. I used[b"//body"][[1]] then saving it as character and then spliting into vectors. Thanks for your help – P.Belai Oct 24 '15 at 21:17
  • 1
    @P.Belai: can you please share how do you save it as character and then splitting into vectors? Thank you! – hoang tran Jun 19 '19 at 14:58