1

I'm trying to scrape html with Nokogiri. This is the html source:

<span id="J_WlAreaInfo" class="wl-areacon">
    <span id="J-From">山东济南</span>
    至
    <span id="J-To">
        <span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
            全国
            <s></s>
        </span>
    </span>
</span> 

I need to get the following text: 山东济南

Checked shortest XPATH with firebug:

//*[@id="J-From"]

Here is my ruby code:

doc = Nokogiri::HTML(open("http://foo.html"), "UTF-8")
area = doc.xpath('//*[@id="J-From"]')
puts area.text

However, it returns nothing. What am I doing wrong?

Zoru
  • 47
  • 7
  • 1
    Maybe you can give us a link to the page? Also, can you look at the page source as originally served: possibly it's creating the `` in javascript after the page loads, but Nokogiri isn't seeing such things? – LarsH Jun 07 '15 at 03:57
  • you may want to look at the top voted answer here regarding open uri http://stackoverflow.com/questions/2572396/nokogiri-open-uri-and-unicode-characters – jvnill Jun 08 '15 at 08:54
  • Thank you guys a thousand times, it is a JS problem. – Zoru Jun 10 '15 at 11:25

1 Answers1

2

However, it returns nothing. What am I doing wrong?

xpath() returns an array containing the matches (it's actually called a NodeSet):

require 'nokogiri'


html = %q{
<span id="J_WlAreaInfo" class="wl-areacon">
    <span id="J-From">山东济南</span>
    至
    <span id="J-To">
        <span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
            全国
            <s></s>
        </span>
    </span>
</span> 
}

doc = Nokogiri::HTML(html)
target_tags = doc.xpath('//*[@id="J-From"]')

target_tags.each do |target_tag|
  puts target_tag.text
end

--output:--
山东济南

Edit: You can actually call text() on the Array, but it will return the concatenated results of the text for each match in the array--which is not something I've ever found useful--but because there is only one match you should have gotten the result 山东济南. There is nothing in your post that indicates why you didn't get that result.

If you only want a single result from your xpath, i.e. the first match, then you can use at_xpath():

target_tag = doc.at_xpath('//*[@id="J-From"]')
puts target_tag.text

--output:--
山东济南
7stud
  • 46,922
  • 14
  • 101
  • 127
  • i can reproduce the same issue as the OP. I'm also getting a blank string so it may be a local machine issue. what's weird is that nokogiri can get the node but the text is a blank string. – jvnill Jun 08 '15 at 08:55
  • Thank you for the answers, the issue was with JavaScript. Nokogiri returned nothing because there was nothing there. – Zoru Jun 10 '15 at 11:25