1

I was working on a parser that could read HTML however the code that splits it causes "l"s to be inserted in every other entry of the produced array.

The regexp is this:

textarea.value.split(/(?=<(.|\n)+>)/)

What it's supposed to do is split entry/exit/single HTML/XML tags while ignoring tabs and line terminators (it just appends them to tags they were split with)

May I have some insite as to what's happening? You can view code in action and edit here: http://jsfiddle.net/termtm/ew7Mt/2/ Just look in console for result it produces.

EDIT: MaxArt is right the l in last <html> causes the anomalies to be "l"s

TERMtm
  • 1,903
  • 3
  • 23
  • 29
  • 1
    [HTML and regexps don't mix](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) - 'nuff said – Alnitak May 30 '12 at 08:25
  • @Alnitak thanks for the link, I enjoyed that. – TERMtm May 30 '12 at 09:54

1 Answers1

1

Try this:

textarea.value.split(/(?=<[^>]+>)/);

But... what Alnitak said. A fully-fledged HTML parser based on regexps, expecially with the poor feature support of regexps in Javascript, would be a terrible (and slow) mess.

I still have to find out the reason of the odd behaviour you found. Notice that "l" (ell) is the last letter of "<html>", i.e., the first tag of your HTML code. Change it to something else and you'll notice the letters change.

MaxArt
  • 22,200
  • 10
  • 82
  • 81
  • nice find, know what causes it yet? – TERMtm May 30 '12 at 09:51
  • @TERMtm Yes, it's caused by the fact that the sequence `(.|\n)` is a capturing group. Change it to `(?:.|\n)` and it should be fine. I still have to understand **why** a capturing group in a regexp used in `split` causes this issue (try `"foobarbaz".split(/(b)/)` too) but maybe it's a standard behaviour that I don't know, and joined with a zero-length separator causes the effect described. Pro-tip: never use capturing groups in `split`. – MaxArt May 30 '12 at 10:15
  • thanks for the help, read up on REGEXP and saw that how that non-greedy example you have works. Wish I could +1 more, you've been great help. – TERMtm May 30 '12 at 18:56