0

Using Ruby: ruby 1.9.3dev (2011-09-23 revision 33323) [i686-linux]

I have the following string:

str = 'Message relates to activity <a href="/activities/35">TU4 Sep 5 Activity 1</a> <img src="/images/layout/placeholder.png" width="222" height="149"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1.'

I want to match the following:

35 (a number which is part of href attribute value)
TU4 Sep 5 Activity (the text for tag)
First question from Manager on TU4 Sep 5 Activity 1. (the remaining text after last <br/><br/> tags)

For achieving the same I have written the following regex

result = str.match(/<a href="\/activities\/(?<activity_id>\d+)">(?<activity_title>.*)<\/a>.*<br\/><br\/>(?<message>.*)/)

This produces following result:

#<MatchData "<a href=\"/activities/35\">TU4 Sep 5 Activity 1</a> <img src=\"/images/layout/placeholder.png\" width=\"222\" height=\"149\"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1." 
         activity_id:"35" 
         activity_title:"TU4 Sep 5 Activity 1" 
         message:"First question from Manager on TU4 Sep 5 Activity 1.">

But I guess this is not efficient. Is it possible that somehow only the required values(as mentioned above under what I want to match) is returned in the matched result and the following value gets excluded from matched result:

"<a href=\"/activities/35\">TU4 Sep 5 Activity 1</a> <img src=\"/images/layout/placeholder.png\" width=\"222\" height=\"149\"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1."

Thanks,

Jignesh

Cerbrus
  • 70,800
  • 18
  • 132
  • 147
Jignesh Gohel
  • 6,236
  • 6
  • 53
  • 89
  • 7
    You should never attempt to parse HTML with regular expressions. This is almost guaranteed to fail. Use a proper (XML) parser instead. See also http://stackoverflow.com/a/1732454/421705 – Holger Just Sep 10 '12 at 07:28

1 Answers1

1

The appropriate way to do this is NOT to use regexen. Instead, use the Nokogiri library to easily parse your html:

require 'nokogiri'

doc = Nokogiri::HTML.parse(str)
activity_id = doc.css('[href^="/activities"]').attr('href').value[/\d+$/]
activity_title = doc.css('[href^="/activities"]')[0].inner_text
message = doc.search("//text()").last

This will do exactly what your regexp was attempting, with much lower chance of random failure.

marcus erronius
  • 3,613
  • 1
  • 16
  • 32