0

All I have found that works at the moment is using spaces to match on. I would like to be able to match arbitrary HTML tags and punctuation.

var text = "<div>The Quick brown fox ran through it's forest darkly!</div>"

//this one uses spaces only but will match "darkly!</div>" as 1 element
console.log(text.match(/\S+/g));

//outputs: ["<div>The", "Quick", "brown", "fox", "ran", "through", "it's", "forest", "darkly!</div>"]

I want a matching expression that will output:

["<div>", "The", "Quick", "brown", "fox", "ran", "through", "it's", "forest", "darkly", "!", "</div>"]

Here is a fiddle: https://jsfiddle.net/scottpatrickwright/og0bd0xj/2/

Ultimately I am going to store all of the matches in an array, do some processing (add some span tags with a conditional data attribute around every whole word) and re-output the original string in an altered form. I mention this as solutions which don't leave the string more or less intact wouldn't work.

I am finding lots of near miss solutions online however my regex is not good enough to take advantage of their work.

Scott Wright
  • 186
  • 1
  • 11
  • 2
    This path is fraught with peril. You'd be better off in the long run using a dedicated HTML parser. – Palpatim May 27 '15 at 15:34
  • Is there a reason for matching expressions with HTML? Could you not obtain the nodeValue or textContent property and match agains that? – MinusFour May 27 '15 at 15:35
  • I heard that [regex cannot be used to parse html](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454) – Ryan Wheale May 27 '15 at 15:38
  • The toughest part of what you are trying to do is split `darkly!` into `darkly` and `!`, but not `it's` into `it`, `'`, and `s`. I'm pretty sure that @Palpatim is right . . . I'm honestly not sure that there is a reasonable regex solution. – talemyn May 27 '15 at 15:41
  • Not to mention other exceptions . . . what about something that contains "day-to-day" or "12:00 AM" . . . my guess is that you would want those to stay grouped as well? – talemyn May 27 '15 at 15:46
  • Thanks for the input. I have been hearing that there is fundamental issue with parsing HTML with regex. Apparently speaking it's not possible to do with pure regular expressions (due to the fundamental limitations, see: http://en.wikipedia.org/wiki/Chomsky_hierarchy). In this case Eric has found something that works pretty well. – Scott Wright May 27 '15 at 20:44

3 Answers3

2

How about:

/(<\/?)?[\w']+>?|[!\.,;\?]/g

Demonstrated here.

Eric Leibenguth
  • 4,167
  • 3
  • 24
  • 51
0

You could just add a space before and after the HTML tags like so:

var text = "<div>The Quick brown fox ran through it's forest darkly!</div>"
text = text.replace(/\<(.*?)\>/g, ' <$1> ');
console.log(text.match(/\w+|\S+/g)); // ## Credit to George Lee ##
Jamie Barker
  • 8,145
  • 3
  • 29
  • 64
  • @talemyn perhaps, however OP says "I want `X` to turn into `Y`" and said code above does just that. You could spend hours _speculating_ what he might and might not need. As for me, I'll happily carry on with my life until the off chance that he says "Oh... what about `Z`?" – Jamie Barker May 27 '15 at 15:51
0

My suggestion would be:

console.log(text.match(/(<.+?>|[^\s<>]+)/g));

Where in our regex: (<.+?>|[^\s<>]+) we specify two strings to catch

<.+?> returns all <text> strings
[^\s<>]+ returns all strings that don't contain space,<,>

in the secound one you could add charatcters you want to ignore

Pęgaz
  • 46
  • 2
  • Thanks for the help & the explanation - appreaciated. This works well except that it includes the punctuation with the word adjacent to it. So you get: ["
    ", "The", "Quick", "brown", "fox", "ran", "through", "it's", "forest", "darkly!", "
    "] Instead of: ["
    ", "The", "Quick", "brown", "fox", "ran", "through", "it's", "forest", "darkly", "!", "
    "]
    – Scott Wright May 27 '15 at 20:47