0

I'm attempting to get attributes from a HTML string that will form an array of {key: '...', value: '...'} via regex. I'm close but I have come across the issue of values containing = splitting the value.

Here's my current regex:

/(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?/g

Here's the test string:

url="https://www.youtube.com/watch?v=123456"

And here's the result: https://regex101.com/r/H2kqVp/1 enter image description here

As you can see, it's splitting after the = in the value, which it should ideally escape. Any idea how to get around that?

Edit: I'm utilizing gulp-html-partial which uses the regex above. I was hoping there was a simple way to modify the pattern to solve this issue.

scferg5
  • 2,003
  • 5
  • 22
  • 28
  • 5
    [Do not parse HTML with RegExp](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use DOMParser on your HTML string and the DOM APIs (Element.attributes inn your case). – Touffy May 01 '18 at 13:19
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Phylogenesis May 01 '18 at 13:22
  • If you want to avoid matching `=` into Group 1 why do you use `\S+` then? `\S` matches `=`. And that is not the only issue with your pattern (e.g. the tempered greedy token is malformed). Please consider using appropriate tools to parse HTML/URL/etc. – Wiktor Stribiżew May 01 '18 at 13:22
  • If you really need to parse out YouTube URLs, there are a million posts for how to do that. If you need to do generic URL parameter ones, there are many posts for those, too. Regex is not needed and it's probably the worst approach – VLAZ May 01 '18 at 13:22
  • 1
    Long story short, I'm using [gulp-html-partial](https://github.com/xkxd/gulp-html-partial) and it uses this regex pattern to gather attributes to replace. I forked it and was hoping I could just modify the regex pattern to avoid this issue. – scferg5 May 01 '18 at 13:52
  • 1
    if gulp-html-partial is parsing HTML with regexes, then don't use it. You can't expect that to work reliably. It is bug-prone and maybe a security risk if there is user-generated content involved. Also, parsing URLs can be done with regexes (after you safely extract the URL) but the [URL API](https://url.spec.whatwg.org/#api) will do it for you and it's available both in browsers and Node.js now. – Touffy May 01 '18 at 14:02

1 Answers1

0

You should be good with a regex pattern

/(?:[?&])(\w+)(?:=)(\w+)/g

which will return two groups for each match.


See example at https://regex101.com/r/u9Z6AJ/1

...and check "Match Information" on right side of the screen, where groups are listed.

Ωmega
  • 42,614
  • 34
  • 134
  • 203