1

I have small problem with a simple tokenizer regex:

def test_tokenizer_regex_limit
   string = '<p>a</p>' * 400
   tokens = string.scan(/(<\s*tag:.*?\/?>)|((?:[^<]|\<(?!\s*tag:.*?\/?>))+)/)
end

Basically it runs through the text and gets pairs of [ matched_tag , other_text ]. Here's an example: http://rubular.com/r/f88JBjfzFh

Works fine for smaller sets. If you run in under ruby 1.8.7 it will blow up. 1.9.2 works fine.

Any ideas how to simplify / improve this? My regex-fu is weak

Grocery
  • 2,244
  • 16
  • 26

1 Answers1

0

This is a bit more simplified but not much:

(<[^<]*:[^<]*>)|((?:[^<]|<[^:]*>)+)

(<.*?>|[^<>]+)

tinifni
  • 2,372
  • 18
  • 15
  • kinda, but not really. I need two capture groups like so: (<.*?>)|([^<>]+) and it's almost there. But! It will match '' into the first group. I need to put tags only of this format everything else should be in the second capture group. – Grocery Oct 28 '10 at 20:34
  • I updated my solution, but it's not much better than what you already have. I hope you find your answer! – tinifni Oct 28 '10 at 22:16