1

I have a string in Rails that contains HTML. For example,

<p>01/28/2016 Green RED Horse!!123 456</p>
<a href="http://greenredhorse.com" style="margin-left:283px;margin-
top:50px;margin-bottom:150px;overflow:auto;position:absolute;">
<img alt="Logo" src="http://greenredhorse.com/images/icons/logo.png" 
style="width:266px" /> </a>
<p>01/28/2017 RED Horse!!123 456</p>

How would I go about removing the link tag and everything between its beginning and end from the string?

The end result should look like this.

<p>01/28/2016 Green RED Horse!!123 456</p>
<p>01/28/2017 RED Horse!!123 456</p>

In short: How can I delete everything between <a and </a> inclusively. Without changing the rest of the string.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
WhyEnBe
  • 295
  • 7
  • 22
  • The tick character in "Logo"` - is it really there or there is a typo? – Wand Maker Jan 28 '16 at 16:26
  • Thanks, that was just a typo. – WhyEnBe Jan 28 '16 at 16:28
  • 1
    You should use a parser when dealing with XML/HTML rather than regular expressions. Parsers are very robust and regex are very fragile when dealing with tags, especially when you don't own and control the generation of that data. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags, which covers the issues nicely. – the Tin Man Jan 28 '16 at 18:12

3 Answers3

4

Update: Better regex than the older one below.

string = <<HTML
<a-tag atr="attr">hi<a>atag</a></a-tag>
<a sdf="</a>"> hola</ a>
HTML
pattern = /<a(?:\s*>|\s+(?:(?:[^=\s]*?(?:=(?:(?:"[^"]*?")|(?:'[^']*?')))?)\s*)*>).*?<\/\s*a>/mi

string.gsub!(pattern, '')
puts string #=> <a-tag atr="attr">hi</a-tag>

Older answer

Something like this assuming that html is the string you want to parse

html.gsub! /<a\s?.+?a>/m, ''

You can use this if you have small sets of data similar to the one you posted. If you want a more robust and bug free solution you can use nokogiri, take a look at the answer of the Tin Man.

mtkcs
  • 1,696
  • 14
  • 27
3

I wouldn't use regex. Regular expressions might work, but the odds of them breaking when the HTML layout changes are very high.

Instead I'd use:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>01/28/2016 Green RED Horse!!123 456</p>
<a href="http://greenredhorse.com" style="margin-left:283px;margin-
top:50px;margin-bottom:150px;overflow:auto;position:absolute;">
<img alt="Logo" src="http://greenredhorse.com/images/icons/logo.png" 
style="width:266px" /> </a>
<p>01/28/2017 RED Horse!!123 456</p>
EOT

doc.at('a').remove

puts doc.to_html
# >> <p>01/28/2016 Green RED Horse!!123 456</p>
# >> 
# >> <p>01/28/2017 RED Horse!!123 456</p>

This is using at which means "find the first occurrence of the desired selector." 'a' is a CSS selector.

Nokogiri is the defacto standard for HTML/XML parsing in Ruby. If you're going to be doing regular work with XML/HTML it is well worth learning to use it.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thanks I am going to look into this. In my specific case I am dealing with a small data set and the tags format will almost always be the same. – WhyEnBe Jan 28 '16 at 18:21
  • 1
    "almost always" is enough to break a regex and your code. – the Tin Man Jan 28 '16 at 18:27
  • @theTinMan I have seen you make edits to my answers to fix grammatical errors and add clarity. You are like WALL-E of StackOverflow :-) – Wand Maker Jan 28 '16 at 18:59
2

You could use XPath to look up elements of interest.

require 'rexml/document'
include REXML

snippet = <<-eos
<p>01/28/2016 Green RED Horse!!123 456</p>
<a href="http://greenredhorse.com" style="margin-left:283px;margin-
top:50px;margin-bottom:150px;overflow:auto;position:absolute;">
<img alt="Logo" src="http://greenredhorse.com/images/icons/logo.png" 
style="width:266px" /> </a>
<p>01/28/2017 RED Horse!!123 456</p>
eos

well_formed_snippet = "<html>#{snippet}</html>"

xmldoc = Document.new(well_formed_snippet)
p XPath.match(xmldoc, "//p").map(&:to_s)
#=> ["<p>01/28/2016 Green RED Horse!!123 456</p>", "<p>01/28/2017 RED Horse!!123 456</p>"]
Wand Maker
  • 18,476
  • 8
  • 53
  • 87