1

I created regex that's supposed to move text inide of an adjoining <span> tag.

const fix = (string) => string.replace(/([\S]+)*<span([^<]+)*>(.*?)<\/span>([\S]+)*/g, "<span$2>$1$3$4</span>")

fix('<p>Given <span class="label">Butter</span>&#39;s game, the tree counts as more than one input.</p>')
// Results in:
'<p>Given <span class="label">Butter&#39;s</span> game, the tree counts as more than one input.</p>'

But if I pass it a string where there is no text touching a <span> tag, it takes a few seconds to run.

I'm testing this on Chrome and Electron.

demux
  • 4,544
  • 2
  • 32
  • 56
  • 5
    HTML parsing with regex? Hmm. – Darin Dimitrov Apr 24 '16 at 08:31
  • If you are concerned only with `span` use this :- `(.*?)<\/span>`..https://regex101.com/r/fL9rG0/1 – rock321987 Apr 24 '16 at 08:31
  • 1
    also I see `([^<]+)*` an extra `*` which I don't think is needed – rock321987 Apr 24 '16 at 08:33
  • If you don't have inner elements, replace `(.*?)` with `([^<]*)`. This will be much faster – Denys Séguret Apr 24 '16 at 08:33
  • 1
    one more thing :- your regex is having catastrophic backtracking if `` is not present – rock321987 Apr 24 '16 at 08:34
  • 2
    Don't do this is the best answer. Use any of the [methods for parsing HTML in JavaScript](http://stackoverflow.com/questions/10585029/parse-a-html-string-with-js). – tadman Apr 24 '16 at 08:46
  • @tadman, can you prove that it's faster to parse the html, manipulate it, and compile it into a string again? – demux Apr 24 '16 at 09:16
  • @demux The performance characteristics of a regular expression of this sort is wildly unpredictable. On some strings it might be faster, but on others it might jam up and take literally forever. I guarantee that the DOMParser solution will produce *consistent* results even if they're not as performant. If this is only running hundreds of times that cost is utterly irrelevant. If this is running frequently then I'd be extremely concerned about using that regular expression. – tadman Apr 24 '16 at 09:21

1 Answers1

4

([\S]+)* and ([^<]+)* are the culprits that causes catastrophic backtracking when there is no </span>. You need to modify your regex to

([\S]*)<span([^<]*)>(.*?)<\/span>([\S]*)

It will work but its still not efficient.

Why use character class for \S? The above reduces to

(\S*)<span([^<]*)>(.*?)<\/span>(\S*)

If you are concerned only about content of span, use this instead

<span([^<]*)>(.*?)<\/span>

Check here <= (See the reduction in number of steps)

NOTE : At last don't parse HTML with regex, if there are tools that can do it much more easily

rock321987
  • 10,942
  • 1
  • 30
  • 43