1

I'm using:

<ul class="ont-bd-phone">[\s\S]+<li>[\s\S]+T:&nbsp;([^$]+?)[\s\S]+<\/li>[\s\S]+<\/ul>

to pick up 020 3514 0019 from:

 <ul class="ont-bd-phone">


          <li>



                T:&nbsp;020 3514 0019


          </li>



    </ul>

But the only match group being returned is '0' instead of the whole number. I'm not sure how to capture the end of the line outside of using $. As a newcomer, how can I deal with HTML that has a lot of whitespace/newlines.

I'm using Rubular to check my work.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
exlo
  • 315
  • 1
  • 8
  • 20
  • 8
    Sorry to repeat that, but use an html parser to extract the `li` text content first. (Nokogiri: http://www.nokogiri.org/tutorials/parsing_an_html_xml_document.html). And then deals with this text content (with a regex if you want). – Casimir et Hippolyte Sep 29 '15 at 19:30
  • While it might seem like a smart thing to do to use a regular expression, experience, a lot of it, has taught us that patterns are very seldom the right path; Instead they generally end up with wailing and gnashing of teeth, followed by frustrated attempts to patch the pattern, followed by rewriting the code multiple times until you give up. Instead, the simple, easy, go-with-the-flow, is to use a parser. It truly is that much easier. See http://stackoverflow.com/q/1732348/128421. – the Tin Man Sep 29 '15 at 23:01

2 Answers2

6

Definitely use something that can read HTML/XML before you start throwing regexes around. It's trivial to find the content in those list items using something like Nokogiri. After that, the regex (if you even really need it) is easy.

To get that text, something like this will work:

require 'nokogiri'

page = # however you are getting the page content...
doc = Nokogiri::HTML(page)
li = doc.css('ul.ont-bd-phone li')
text = li.text.strip
# => T: 020 3514 0019

If there are multiple list items you are looking for, you can map/each over them to get everything out. Nokogiri's documentation is great and covers a lot of uses.

Nick Veys
  • 23,458
  • 4
  • 47
  • 64
-1

remove the ? from the group ([^$]+) or just write (.*)

<ul class="ont-bd-phone">[\s\S]+<li>[\s\S]+T:&nbsp;(.*)[\s\S]+<\/li>[\s\S]+<\/ul>
Abdoo Dev
  • 1,216
  • 11
  • 16
  • 1
    While this is targeting the OPs question, the OP should be asking what the best way to process HTML/XML is, which would totally remove the need for such a regular expression. Part of answering questions is to advise when a particular programming choice isn't good. – the Tin Man Sep 29 '15 at 20:22