Use regex to separate any string into an array of whole words, punctuation & html tags

Question

All I have found that works at the moment is using spaces to match on. I would like to be able to match arbitrary HTML tags and punctuation.

var text = "<div>The Quick brown fox ran through it's forest darkly!</div>"

//this one uses spaces only but will match "darkly!</div>" as 1 element
console.log(text.match(/\S+/g));

//outputs: ["<div>The", "Quick", "brown", "fox", "ran", "through", "it's", "forest", "darkly!</div>"]

I want a matching expression that will output:

["<div>", "The", "Quick", "brown", "fox", "ran", "through", "it's", "forest", "darkly", "!", "</div>"]

Here is a fiddle: https://jsfiddle.net/scottpatrickwright/og0bd0xj/2/

Ultimately I am going to store all of the matches in an array, do some processing (add some span tags with a conditional data attribute around every whole word) and re-output the original string in an altered form. I mention this as solutions which don't leave the string more or less intact wouldn't work.

I am finding lots of near miss solutions online however my regex is not good enough to take advantage of their work.

This path is fraught with peril. You'd be better off in the long run using a dedicated HTML parser. — Palpatim, May 27 '15 at 15:34
Is there a reason for matching expressions with HTML? Could you not obtain the nodeValue or textContent property and match agains that? — MinusFour, May 27 '15 at 15:35
I heard that [regex cannot be used to parse html](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454) — Ryan Wheale, May 27 '15 at 15:38
The toughest part of what you are trying to do is split `darkly!` into `darkly` and `!`, but not `it's` into `it`, `'`, and `s`. I'm pretty sure that @Palpatim is right . . . I'm honestly not sure that there is a reasonable regex solution. — talemyn, May 27 '15 at 15:41
Not to mention other exceptions . . . what about something that contains "day-to-day" or "12:00 AM" . . . my guess is that you would want those to stay grouped as well? — talemyn, May 27 '15 at 15:46
Thanks for the input. I have been hearing that there is fundamental issue with parsing HTML with regex. Apparently speaking it's not possible to do with pure regular expressions (due to the fundamental limitations, see: http://en.wikipedia.org/wiki/Chomsky_hierarchy). In this case Eric has found something that works pretty well. — Scott Wright, May 27 '15 at 20:44

Eric Leibenguth · Accepted Answer · 2015-05-27T15:43:30.173

2

How about:

/(<\/?)?[\w']+>?|[!\.,;\?]/g

Demonstrated here.

edited May 27 '15 at 15:43

answered May 27 '15 at 15:36

Eric Leibenguth

4,167
3
24
51

Jamie Barker · Answer 2 · 2015-05-27T15:38:03.417

0

You could just add a space before and after the HTML tags like so:

var text = "<div>The Quick brown fox ran through it's forest darkly!</div>"
text = text.replace(/\<(.*?)\>/g, ' <$1> ');
console.log(text.match(/\w+|\S+/g)); // ## Credit to George Lee ##

edited May 27 '15 at 15:38

answered May 27 '15 at 15:34

Jamie Barker

8,145
3
29
64

@talemyn perhaps, however OP says "I want `X` to turn into `Y`" and said code above does just that. You could spend hours _speculating_ what he might and might not need. As for me, I'll happily carry on with my life until the off chance that he says "Oh... what about `Z`?" – Jamie Barker May 27 '15 at 15:51

score 0 · Answer 3 · answered May 27 '15 at 15:53

0

My suggestion would be:

console.log(text.match(/(<.+?>|[^\s<>]+)/g));

Where in our regex: (<.+?>|[^\s<>]+) we specify two strings to catch

<.+?> returns all <text> strings
[^\s<>]+ returns all strings that don't contain space,<,>

in the secound one you could add charatcters you want to ignore

answered May 27 '15 at 15:53

Pęgaz

46
2

Thanks for the help & the explanation - appreaciated. This works well except that it includes the punctuation with the word adjacent to it. So you get: ["
", "The", "Quick", "brown", "fox", "ran", "through", "it's", "forest", "darkly!", "
"] Instead of: ["
", "The", "Quick", "brown", "fox", "ran", "through", "it's", "forest", "darkly", "!", "
"] – Scott Wright May 27 '15 at 20:47

Use regex to separate any string into an array of whole words, punctuation & html tags

3 Answers3