2

I have these two HTML strings:

a="<div> foo: <span>bar</span> </div>"
b="<div> foo: bar <br> </div>"

I want to find foo: bar from each string.

The way I want to do it is to find from the word 'foo' until I come across a '<' character.

I can do this with the regular expression:

foo([^(<)]+)

This only finds "foo: bar" from string b but not from string a because the <span> tag is in the way. So I want to write the regex to look from foo until it finds a < character ignoring the <span> tag.

These are just some of the strings that this has to work on therefore it has to work like states i.e. I can not start removing tags before or after etc.

Basically all I need to know is how to find all characters in a string until I come across a certain character, unless that character is is followed by a set of specified characters, i.e. find until < but if < is followed by span> then look for the next <.

Does anyone know how to do this?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Rick Moss
  • 926
  • 1
  • 17
  • 34
  • You should avoid using regex to parse HTML. But if you really want this I could whip up a solution for you. – Firas Dib Nov 27 '13 at 13:47
  • Basically, you should just strip down all `` tags ? – HamZa Nov 27 '13 at 13:48
  • I would first remove the tags with `.gsub(/<.+?>/, '').strip.squeeze(' ')`. – spickermann Nov 27 '13 at 13:52
  • Unfortunately, the closest I can get to a solution right now would be something like this: http://regex101.com/r/uH2sT1 - which is far from perfect. I would just avoid using regex for this problem really. – Firas Dib Nov 27 '13 at 13:54
  • @Lindrian in this case a regex cant be avoided but i agree with you in general. Please can you give me a solution in regex ? For all other help thanks but as stated i can solve this problem by spliting / removing just the span tag – Rick Moss Nov 27 '13 at 13:57
  • Obligatory link: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Andrew Grimm Nov 28 '13 at 22:25

2 Answers2

1

Although using regexp to get things out of HTML is usually bad, you could solve the problem in this way:

foo, bar = string.gsub(/<.*?>/, '').strip.split

Edit: Well, then you might want to look into negative look ahead for regexp: (?!regpattern)

string[/(foo.*)<((?!span|\/span))/,1]
# match foo followed by all character until < unless the character after that is /span or span
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
hirolau
  • 13,451
  • 8
  • 35
  • 47
  • thanks for the help but i cant do this as my actual strings contain more text thank just foo: bar so i must stop on the '<' character but i need to ignore the string in the regex. do you know how to do it this way ? – Rick Moss Nov 27 '13 at 13:59
1

There are many, many reasons why you don't want to use regex to process HTML. Your example text is very simple, however it's highly likely in a real-world use the HTML will be a lot more complex and variable, which will cause a regular expression based solution to become very fragile.

Instead, start with the right tool and use a parser:

require 'nokogiri'

[
  "<div> foo: <span>bar</span> </div>",
  "<div> foo: bar <br> </div>"
].each do |str|
  doc = Nokogiri::HTML::DocumentFragment.parse(str)
  puts doc.at('div').text
end

Which outputs:

 foo: bar
 foo: bar

This uses Nokogiri, which is a very capable XML/HTML parser, and is the standard XML/HTML parser for Ruby.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303