regular expression to find until characters?

Question

I have these two HTML strings:

a="<div> foo: <span>bar</span> </div>"
b="<div> foo: bar <br> </div>"

I want to find foo: bar from each string.

The way I want to do it is to find from the word 'foo' until I come across a '<' character.

I can do this with the regular expression:

foo([^(<)]+)

This only finds "foo: bar" from string b but not from string a because the <span> tag is in the way. So I want to write the regex to look from foo until it finds a < character ignoring the <span> tag.

These are just some of the strings that this has to work on therefore it has to work like states i.e. I can not start removing tags before or after etc.

Basically all I need to know is how to find all characters in a string until I come across a certain character, unless that character is is followed by a set of specified characters, i.e. find until < but if < is followed by span> then look for the next <.

Does anyone know how to do this?

You should avoid using regex to parse HTML. But if you really want this I could whip up a solution for you. — Firas Dib, Nov 27 '13 at 13:47
I would first remove the tags with `.gsub(/<.+?>/, '').strip.squeeze(' ')`. — spickermann, Nov 27 '13 at 13:52
Unfortunately, the closest I can get to a solution right now would be something like this: http://regex101.com/r/uH2sT1 - which is far from perfect. I would just avoid using regex for this problem really. — Firas Dib, Nov 27 '13 at 13:54
@Lindrian in this case a regex cant be avoided but i agree with you in general. Please can you give me a solution in regex ? For all other help thanks but as stated i can solve this problem by spliting / removing just the span tag — Rick Moss, Nov 27 '13 at 13:57
Obligatory link: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Andrew Grimm, Nov 28 '13 at 22:25

score 1 · Accepted Answer · edited Nov 28 '13 at 12:06

1

Although using regexp to get things out of HTML is usually bad, you could solve the problem in this way:

foo, bar = string.gsub(/<.*?>/, '').strip.split

Edit: Well, then you might want to look into negative look ahead for regexp: (?!regpattern)

string[/(foo.*)<((?!span|\/span))/,1]
# match foo followed by all character until < unless the character after that is /span or span

edited Nov 28 '13 at 12:06

the Tin Man

158,662
42
215
303

answered Nov 27 '13 at 13:50

hirolau

13,451
8
35
47

thanks for the help but i cant do this as my actual strings contain more text thank just foo: bar so i must stop on the '<' character but i need to ignore the string in the regex. do you know how to do it this way ? – Rick Moss Nov 27 '13 at 13:59

score 1 · Answer 2 · answered Nov 28 '13 at 12:11

There are many, many reasons why you don't want to use regex to process HTML. Your example text is very simple, however it's highly likely in a real-world use the HTML will be a lot more complex and variable, which will cause a regular expression based solution to become very fragile.

Instead, start with the right tool and use a parser:

require 'nokogiri'

[
  "<div> foo: <span>bar</span> </div>",
  "<div> foo: bar <br> </div>"
].each do |str|
  doc = Nokogiri::HTML::DocumentFragment.parse(str)
  puts doc.at('div').text
end

Which outputs:

 foo: bar
 foo: bar

This uses Nokogiri, which is a very capable XML/HTML parser, and is the standard XML/HTML parser for Ruby.

regular expression to find until characters?

2 Answers2