1

For example:

<p>
<b>Member Since:</b> Aug. 07, 2010<br><b>Time Played:</b> <span class="text_tooltip" title="Actual Time: 15.09:37:06">16 days</span><br><b>Last Game:</b>
<span class="text_tooltip" title="07/16/2011 23:41">1 minute ago</span>
<br><b>Wins:</b> 1,017<br><b>Losses / Quits:</b> 883 / 247<br><b>Frags / Deaths:</b> 26,955 / 42,553<br><b>Hits / Shots:</b> 690,695 / 4,229,566<br><b>Accuracy:</b> 16%<br>
</p>

I want to get 1,017. It is a text after the tag, containing text Wins:.
If I used regex, it would be [/<b>Wins:<\/b> ([^<]+)/,1], but how to do it with Nokogiri and XPath? Or should I better parse this part of page with regex?

Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125
Nakilon
  • 34,866
  • 14
  • 107
  • 142
  • Regex is fine when the task is extremely simple, and/or, when you control the generation of the HTML or XML. When the generation leaves your control it becomes more risky, because the file can change unexpectedly, leading to more complicated regex and/or supporting code. A parser tends to keep that from occurring, making the long term support an easier task. From my own experience, having to clean and maintain other people's code, I have been able to drastically reduce regex-based code by switching to a good parser, while simplifying it, both very desirable in production environments. – the Tin Man Jul 17 '11 at 22:42
  • While it is possible to write a sophisticated regex to handle more situations, it also becomes more of a development and maintenance task, which leads to entropy setting in. It is important to remember that though something can be done using a particular tool, it might be better done using another. That is often the case with regex; It's sexy and macho to use but those aren't good reasons to pick it. Instead, use regex when it is clearly the shorter and more simple path to the desired result, weighing in the need for long-term support. – the Tin Man Jul 17 '11 at 22:55
  • @the Tin Man, next time I write a question about parsing, I'll add *pleeease don't start holywar, SO is full of it, we don't need more copypaste of emptysense debates* to prevent it in answers. But anyway thanks for your thoughts. – Nakilon Jul 18 '11 at 10:05
  • "Empty sense"? "Holy war"? Curious choices in words. – the Tin Man Jul 18 '11 at 22:30

4 Answers4

3

Here

doc = Nokogiri::HTML(html)
puts doc.at('b[text()="Wins:"]').next.text
akuhn
  • 27,477
  • 2
  • 76
  • 91
1

You can use this XPath: //*[*/text() = 'Wins:']/text() It will return 1,017.

About regex: RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125
1

I would use pure XPath like:

"//b[.='Wins:']/following::node()[1]"

I've heard thousand of times (and from gurus) "never use regex to parse XML". Can you provide some "shocking" reference demonstrating that this sentence is not valid any more?

Emiliano Poggi
  • 24,390
  • 8
  • 55
  • 67
  • I've heard thousand of times (and from gurus) *"if regexes are enough and are the easiest solution, use them"*. Can you provide some "shocking" reference demonstrating that I can't use regex in, for example, my current task from the Question? – Nakilon Jul 17 '11 at 11:53
  • 1
    That's a general suggestion and in your specific case you are true, and you can go probably stay with regex, without worry too much. However I'm of the idea that XPath becomes indispensable when you have more complex node selections. – Emiliano Poggi Jul 17 '11 at 12:00
  • 1
    Other consideration: if you are thinking of using Nokogiri just for this small task, you should use regex indeed. If you are already using Nokogiri in you application, or if your selection will grow in complexity, you should exploit XPath and CSS selectors definitely. – Emiliano Poggi Jul 17 '11 at 12:04
  • I'm totally agree with you. Both tools are more suitable in own tasks. That's the right answer on this holywar. And in my current task I'm gonna use xpath because page will have a lot of data to get, not only one number. – Nakilon Jul 17 '11 at 14:01
  • Be careful, your html is not well formed because of the `br` unclosed tags. And you may be will be aable to fix those, using regex ;-) – Emiliano Poggi Jul 17 '11 at 14:36
  • @empo: It's valid HTML (but not XML). And there's no need to use regexps to fix those, a regular string replacement is enough. – You Jul 23 '11 at 21:46
  • @You I was referring to any possible HTML unclosed tag. Like `img`, _find and replace_ will not be enough. – Emiliano Poggi Jul 25 '11 at 08:41
0

Use:

//*[. = 'Wins:']/following-sibling::node()[1]

In case this is ambiguous (selects more than one node), more strict expressions can be specified:

//*[. = 'Wins:']/following-sibling::node()[self::text()][1]

Or:

(//*[. = 'Wins:'])[1]/following-sibling::node()[1]

Or:

(//*[. = 'Wins:'])[1]/following-sibling::node()[self::text()][1]
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431