Separate Matches in RegEx

Question

I am parsing HTML with this regex in javascript for selecting the attribute values on HTML elements:

/(\".+\")/g

It works fine when there is a single attribute, but when there are multiple attributes, like so:

<a href="#" class="button">See How</a>

it is matching from the first quote on the first attribute to the last quote on the second. How can I get the regex to identify the attribute values as separate matches?

*"I am parsing HTML with this regex"* - But why?! – Tomalak Mar 31 '16 at 20:07 — Tomalak, Mar 31 '16 at 20:07

score 1 · Answer 1 · answered Mar 31 '16 at 20:08

1

The matching is greedy by default. Try this:

/(\".+?\")/g

answered Mar 31 '16 at 20:08

Brad C

71
1
3

Rajaprabhu Aravindasamy · Answer 2 · 2016-03-31T20:10:38.020

0

You have to stop the greediness of that regular expression by placing ?,

/(\".*?\")/g

Also you have to use * at this context instead of +. Because if you have an empty attribute then it would match the next attribute also along with the attribute name.

edited Mar 31 '16 at 20:10

answered Mar 31 '16 at 20:07

Rajaprabhu Aravindasamy

66,513
17
101
130

I thought we would not recommend regular expressions to parse HTML on this site, especially not to people who don't know enough about regular expressions to be unaware of the fact that they can't be used to parse HTML? – Tomalak Mar 31 '16 at 20:09
Or use not: ` /("^")/g ` – Arif Burhan Mar 31 '16 at 20:09
@Tomalak TBH, I just started learning regex from yesterday. Practicing it now. Can you explain to me why this is not a good context to use regex? Along with OP I will also learn about it. THanks. – Rajaprabhu Aravindasamy Mar 31 '16 at 20:12
This has been discussed to death. Seriously, it's not even funny anymore. Search for it. One hit, among many. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. Regular expressions can understand regular languages (hence the name). HTML is not a regular language and therefore outside of the domain of problems that regular expressions can solve. If regular expressions really could parse HTML, we would not have the complex HTML parsers that we have (look at the source code of an HTML parser to get an idea how complex it this). – Tomalak Mar 31 '16 at 20:17
@Tomalak, I am taking html snippets that are to be delivered to a client and adding markup to them so they will display nicely as an html page for the client. Can you suggest a better method of doing this? – Alex Mar 31 '16 at 20:23
@Alex Yes, use a parser and a DOM API. This is the method of handling HTML. If your Javascript runs on the server: There are HTML parser/DOM API packages for node.js, use one. If it runs on the client: Web browsers are the most advanced HTML parsers in existence, you can parse HTML in them basically for free. Either way, you will be a lot better off with a parser. Take it from someone with pretty solid regex skills. – Tomalak Mar 31 '16 at 20:31
@Tomalak, Yes, it would be on the server and run as a grunt task. I did a quick search for node modules that would do what I wanted, but couldn't find anything. I refined my query and found something that would have saved me a lot of time from messing around with these regexs, which obviously, are not my specialty. Thanks for the help. – Alex Mar 31 '16 at 20:56
@Alex See http://stackoverflow.com/questions/7372972/how-do-i-parse-a-html-page-with-node-js, among other search hits of the "how to parse HTML with node.js" variety. I know that regex seems quick and easy but the main error starts with the thought that HTML is "just a string". It's not. It's a complex data structure that has been serialized. If you use string tools on it (most prominently regex search&replace), you *will* mess up at some point and create run-time bugs at best, security holes at worst and maintenance nightmares in any case. – Tomalak Mar 31 '16 at 21:20

Separate Matches in RegEx

2 Answers2