1

How can I go about reading a specific line/lines from html in R?

I have "HTMLInternalDocument" object as a result of following code:

url<-myURL
html<-htmlTreeParse(url,useInternalNodes=T)

Now I need get a specific lines from this html object in text format to count number of characters in each lines for example.

How can I do that in R?

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • 1
    This question is so generic it's impossible to answer precisely. What lines do you want to extract? How can you identify them in the HTML source? YOu really should include sample data and desired output. See [how to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more tips on how to make is possible for us to help you. – MrFlick Jul 27 '14 at 19:28
  • Once you have parsed the document its a parse tree so there are no lines. Read it in using `readLines` if you want it as lines. – G. Grothendieck Jul 27 '14 at 21:37

2 Answers2

0

Seeing that you are using the XML library, you will need to use one of the library's getNodeSet functions such as xpathApply. This requires some knowledge on xPaths, which the function uses to parse the HTMLInternalDocument. You can learn more by using ?xpathApply

Evan Kaminsky
  • 695
  • 10
  • 23
0

Using the XML library is over-complicating the problem. As Grothendieck pointed out readLines, a base function, will do the job. Something like this:

x <- 10 ## or any other index you want to subset on
html <- readLines(url)
html[x]
Conner M.
  • 1,954
  • 3
  • 19
  • 29