Parsing & replacing multiple links but not when one contains an other

Question

I can't figure out how to (easily) avoid link (2) to replace the beginning of link (1). I'd appreciate an answer in Ruby but if you figure out the logic it's good too.

The output should be:

 message = "For Last Minute rentals, please go to:
    <span class='external_link' href-web='http://www.mydomain.com/thepage'>http://www.mydomain.com/thepage</span> (1)

    For more information about our events, please visit our website: 
    <span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span> (2)"

But it is:

    message = "For Last Minute rentals, please go to:
    <span class='external_link' href-web='<span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span>/thepage'><span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span>/thepage</span> (1)

    For more information about our events, please visit our website: 
    <span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span> (2)"

Here's the code (edited: took out the spans):

     message = "For Last Minute rentals, please go to:
    http://www.mydomain.com/thepage

    For more information about our events, please visit our website: 
    http://www.mydomain.com"

   links_found = URI.extract(message, ['http', 'https'])

   for link_found in links_found          
     message.gsub!(link_found,"<span class='external_link' href-web='#{web_link}'>#{link_found}</span>")
   end

Thoughts?

@theTinMan no it was just to make the example easier to explain — Alextoul, Apr 25 '13 at 16:46

user2276204 · Answer 1 · 2013-04-25T00:29:26.423

0

I would guess that your problem is related to URI.extract. When it goes through message it's pulling all the instances of "http", which, for the first line, would be both "http" inside and outside the <span>.

To further clarify, links_found would be an array with both <span...href-web:... and http...</span>. Since you're only passing link_found to gsub as the pattern to match, it will replace every object in the links_found[] array

edited Apr 25 '13 at 00:29

answered Apr 25 '13 at 00:24

user2276204

257
1
3
9

Thanks! It definitely helps understanding the WHY but I can't think of a solution right now... – Alextoul Apr 25 '13 at 00:39
Instead of passing `link_found` as the pattern to match, I would either be more specific when passing to `URI.extract`, or add to `link_found` when you pass it to gsub. As an example: `message.gsub('href-web=#{link_found}', ...)` – user2276204 Apr 25 '13 at 00:49

score 0 · Answer 2 · edited May 23 '17 at 10:25

First, rule one, don't bother with string manipulation or regular expressions for anything but the most trivial things when dealing with HTML or XML. Doing otherwise is a sure recipe for madness.

Instead, save your sanity and go for a real parser. For Ruby I strongly suggest you look at Nokogiri only - it just works.

Consider this code:

require 'nokogiri'

message = "For Last Minute rentals, please go to:
<span class='external_link' href-web='http://www.mydomain.com/thepage'>http://www.mydomain.com/thepage</span> (1)

For more information about our events, please visit our website: 
<span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span> (2)"

doc = Nokogiri::HTML(message)

external_spans = doc.search('span.external_link')

url1 = external_spans[0]['href-web'] # => "http://www.mydomain.com/thepage"
text1 = external_spans[0].text       # => "http://www.mydomain.com/thepage"
url2 = external_spans[1]['href-web'] # => "http://www.mydomain.com"
text2 = external_spans[1].text       # => "http://www.mydomain.com"

url and text1 are the URLs from span 1 and url2 and text2 are from span 2 respectively.

I'm not sure what you want to do with them, because, after a more-than-cursory glance I couldn't see a difference in your source and desired output, but, once you have them you're pretty much free to do anything. A parser, like Nokogiri, lets you retrieve information from the HTML or XML DOM, replace it, move things around, or even splice in new stuff.

Thanks! I will try playing around Nokogori tomorrow. I actually edited the 'message' variable I have as an input. It doesn't have the spans initially.. Sorry! — Alextoul, Apr 25 '13 at 09:43
Ok, with that edit it becomes understandable. The problem doesn't require Nokogiri though it can help. I'll add to my answer when I get near a computer. Basically though, `URI.extract` is a good starting tool but it doesn't give us enough positional information about the URLs it finds, and without that you can't pinpoint where a substitution should occur. — the Tin Man, Apr 25 '13 at 12:36

Parsing & replacing multiple links but not when one contains an other

2 Answers2