RegexpError: Stack overflow in regexp matcher

Question

I have small problem with a simple tokenizer regex:

def test_tokenizer_regex_limit
   string = '<p>a</p>' * 400
   tokens = string.scan(/(<\s*tag:.*?\/?>)|((?:[^<]|\<(?!\s*tag:.*?\/?>))+)/)
end

Basically it runs through the text and gets pairs of [ matched_tag , other_text ]. Here's an example: http://rubular.com/r/f88JBjfzFh

Works fine for smaller sets. If you run in under ruby 1.8.7 it will blow up. 1.9.2 works fine.

Any ideas how to simplify / improve this? My regex-fu is weak

I'm not really (x)html parsing though. Just need to tokenize text like this: text | | more text | . As a matter of fact it could be a string like this: '' Not really parsable. — Grocery, Oct 28 '10 at 20:39

tinifni · Answer 1 · 2010-10-28T22:15:57.197

0

This is a bit more simplified but not much:

(<[^<]*:[^<]*>)|((?:[^<]|<[^:]*>)+)

~~(<.*?>|[^<>]+)~~

edited Oct 28 '10 at 22:15

answered Oct 28 '10 at 20:21

tinifni

kinda, but not really. I need two capture groups like so: (<.*?>)|([^<>]+) and it's almost there. But! It will match '' into the first group. I need to put tags only of this format everything else should be in the second capture group. – Grocery Oct 28 '10 at 20:34
I updated my solution, but it's not much better than what you already have. I hope you find your answer! – tinifni Oct 28 '10 at 22:16

1 Answers1