How to extract html attributes via regex

Question

I am looking to see how a regex can be used to get attribute/values from an html tag. Yes I know that an xml/html parser can be used, but this is for testing my ability in regex. For example, in this html element:

<input name=dir value=">">
<input value=">" name=dir >

How would I extract out:

(?<name>...) and (?<value>...)

Is it possible once you have matched something to go "back" to the start of the match? For example:

<(?P<element>\w+).+(?:value="(?P<value>[^"])")@@@@.+(?:name="(?P<name>[^"])")

Where @@@@ basically means "go back to the start of the previous match/capture group (so that I don't have to modify every possible ordering of the tags). How could this be done?

Pretty sure this question is referring to the exact reason why this doesn't work. The whole go back to the start of the previous string is what a pushdown automaton does. Something regular languages I.e. regex, cannot handle. — AER, Nov 07 '19 at 00:00
@AER I see. So basically you'd need to do some sort of combinatorics for this to work for all the possible positions? — samuelbrody1249, Nov 07 '19 at 00:37
It's a bit of a hunch, I'm not a CS student. But it's something that requires memory. Regex only uses what it's presented with. It doesn't go back and go remember what it did on the previous step. I'd turn this into an answer if I knew more, because this is actually a great example I suspect to highlight why it doesn't work. — AER, Nov 07 '19 at 00:44
See this Wikipedia article: https://en.wikipedia.org/wiki/Chomsky_hierarchy And note the memory used by all the different machines that create the different languages. — AER, Nov 07 '19 at 00:45
You can't, see [... FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆](https://stackoverflow.com/a/1732454/7505395) - thats why parsers exist. If you want to train regex, try [https://regexcrossword.com/](https://regexcrossword.com/) — Patrick Artner, Nov 07 '19 at 06:30

score 0 · Accepted Answer · answered Nov 07 '19 at 06:17

Yes, using a parser is the best way.
As stated in the comments, you cannot (easily) extract all information in one sweep.
You can achieve what you want with several regexes:

input.*?name=(?'name'[^ ]+)

Test here.

input.*?value="(?'value'[^"]+)"

Test here.

How to extract html attributes via regex

1 Answers1