To get text after the tag, containing another text

Question

For example:

<p>
<b>Member Since:</b> Aug. 07, 2010<br><b>Time Played:</b> <span class="text_tooltip" title="Actual Time: 15.09:37:06">16 days</span><br><b>Last Game:</b>
<span class="text_tooltip" title="07/16/2011 23:41">1 minute ago</span>
<br><b>Wins:</b> 1,017<br><b>Losses / Quits:</b> 883 / 247<br><b>Frags / Deaths:</b> 26,955 / 42,553<br><b>Hits / Shots:</b> 690,695 / 4,229,566<br><b>Accuracy:</b> 16%<br>
</p>

I want to get 1,017. It is a text after the tag, containing text Wins:.
If I used regex, it would be [/<b>Wins:<\/b> ([^<]+)/,1], but how to do it with Nokogiri and XPath? Or should I better parse this part of page with regex?

Regex is fine when the task is extremely simple, and/or, when you control the generation of the HTML or XML. When the generation leaves your control it becomes more risky, because the file can change unexpectedly, leading to more complicated regex and/or supporting code. A parser tends to keep that from occurring, making the long term support an easier task. From my own experience, having to clean and maintain other people's code, I have been able to drastically reduce regex-based code by switching to a good parser, while simplifying it, both very desirable in production environments. — the Tin Man, Jul 17 '11 at 22:42
While it is possible to write a sophisticated regex to handle more situations, it also becomes more of a development and maintenance task, which leads to entropy setting in. It is important to remember that though something can be done using a particular tool, it might be better done using another. That is often the case with regex; It's sexy and macho to use but those aren't good reasons to pick it. Instead, use regex when it is clearly the shorter and more simple path to the desired result, weighing in the need for long-term support. — the Tin Man, Jul 17 '11 at 22:55
@the Tin Man, next time I write a question about parsing, I'll add *pleeease don't start holywar, SO is full of it, we don't need more copypaste of emptysense debates* to prevent it in answers. But anyway thanks for your thoughts. — Nakilon, Jul 18 '11 at 10:05

akuhn · Accepted Answer · 2011-07-23T21:42:37.513

3

Here

doc = Nokogiri::HTML(html)
puts doc.at('b[text()="Wins:"]').next.text

edited Jul 23 '11 at 21:42

answered Jul 17 '11 at 07:52

akuhn

27,477
2
76
91

add a trailing `.text` to your `next` and this would be my recommended answer. – the Tin Man Jul 17 '11 at 23:00

score 1 · Answer 2 · edited May 23 '17 at 11:53

1

You can use this XPath: //*[*/text() = 'Wins:']/text() It will return 1,017.

About regex: RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 11:53

Community

1
1

answered Jul 17 '11 at 06:40

Kirill Polishchuk

54,804
11
122
125

You are not right about regexes. Mention about regexes are not suitable for XML is outdated. Read about recursive regexes for more info. – Nakilon Jul 17 '11 at 07:58
@Nakilon, What is "for XML is outdated"? – Kirill Polishchuk Jul 17 '11 at 08:17
"regexes are not suitable for XML" is outdated. – Nakilon Jul 17 '11 at 09:16

Emiliano Poggi · Answer 3 · 2011-07-17T14:35:36.487

1

I would use pure XPath like:

"//b[.='Wins:']/following::node()[1]"

I've heard thousand of times (and from gurus) "never use regex to parse XML". Can you provide some "shocking" reference demonstrating that this sentence is not valid any more?

edited Jul 17 '11 at 14:35

answered Jul 17 '11 at 11:46

Emiliano Poggi

24,390
8
55
67

I've heard thousand of times (and from gurus) *"if regexes are enough and are the easiest solution, use them"*. Can you provide some "shocking" reference demonstrating that I can't use regex in, for example, my current task from the Question? – Nakilon Jul 17 '11 at 11:53
1

That's a general suggestion and in your specific case you are true, and you can go probably stay with regex, without worry too much. However I'm of the idea that XPath becomes indispensable when you have more complex node selections. – Emiliano Poggi Jul 17 '11 at 12:00
1

Other consideration: if you are thinking of using Nokogiri just for this small task, you should use regex indeed. If you are already using Nokogiri in you application, or if your selection will grow in complexity, you should exploit XPath and CSS selectors definitely. – Emiliano Poggi Jul 17 '11 at 12:04
I'm totally agree with you. Both tools are more suitable in own tasks. That's the right answer on this holywar. And in my current task I'm gonna use xpath because page will have a lot of data to get, not only one number. – Nakilon Jul 17 '11 at 14:01
Be careful, your html is not well formed because of the `br` unclosed tags. And you may be will be aable to fix those, using regex ;-) – Emiliano Poggi Jul 17 '11 at 14:36
@empo: It's valid HTML (but not XML). And there's no need to use regexps to fix those, a regular string replacement is enough. – You Jul 23 '11 at 21:46
@You I was referring to any possible HTML unclosed tag. Like `img`, _find and replace_ will not be enough. – Emiliano Poggi Jul 25 '11 at 08:41

score 0 · Answer 4 · answered Jul 17 '11 at 14:53

Use:

//*[. = 'Wins:']/following-sibling::node()[1]

In case this is ambiguous (selects more than one node), more strict expressions can be specified:

//*[. = 'Wins:']/following-sibling::node()[self::text()][1]

Or:

(//*[. = 'Wins:'])[1]/following-sibling::node()[1]

Or:

(//*[. = 'Wins:'])[1]/following-sibling::node()[self::text()][1]

To get text after the tag, containing another text

4 Answers4