1

I have a form with a number of input and select html elements. The problem is that every input and select has a name but not all of them have values (Select by default) and at least one select element has no " quotation marks in the name attribute.

I am willing to get all the names and values with one expression.

Here is the form (without \n and \r signs):

http://pastebin.com/QaXNqcHH

And here his my code:

MatchCollection mtches;

mtches = Regex.Matches(registerForm, "(?:(?:<input)|(?:<select))[^>]*?name=\"?(?<name>.+?)(?:(?:\")|(?:>))[^>]*?(?:value=\"(?<value>.*?)\")?[^>]*?> ");

I successfully got all the names of each input and select, but the problem is that it doesn't extract the value matches.

newfurniturey
  • 37,556
  • 9
  • 94
  • 102
Matan Givoni
  • 1,080
  • 12
  • 34
  • The value attribute can legally come before the name attribute. Single quotes are legal substitutes for double quotes. [Greater-than symbols can legally appear inside quoted attribute values](http://stackoverflow.com/q/94528/211627). Tag/attribute tokens may contain mixed case (not always legal, but browsers won't complain). Whitepsace may appear between tokens (e.g. `name = 'firstname'`). Newlines may appear in an attribute value (`.*` does not match newlines by default). `name` and `value` are optional attributes. There are so many problems with using regex, it's easier to use a parser. – JDB Dec 31 '12 at 14:48

1 Answers1

1

Don't use regex to parse html. Here's a SO-member that was at the brink of insanity, related to the subject: https://stackoverflow.com/a/1732454/1548853

Find yourself a html parser you like and that is easy to work with.

Community
  • 1
  • 1
Firas Dib
  • 2,743
  • 19
  • 38
  • I like to point out to new or inexperienced programmers that their gut feeling is correct... [it is possible to parse HTML with regex, but very, very hard](http://stackoverflow.com/a/4234491/211627). It is much, much easier to use a proper HTML parser. – JDB Dec 31 '12 at 14:33
  • And I would add that if you have to ask how to parse HTML with regexes, then you do not know regexes well enough to parse HTML. It's like asking how to strap on skis so that you can participate in an Olympic ski jump. – JDB Dec 31 '12 at 14:38
  • In theory, you can not fully parse html with regex since regex is a regular language, html an irregular. I don't want to dig too deep into this since language discussions are so vast. Just know that the few regular expression engines that allow you to parse or almost parse html are very few and they are not entirely regular anymore. – Firas Dib Dec 31 '12 at 14:57
  • I would like to say this is my first time using Regex and you are right this is not a good tool for parsing html. – Matan Givoni Dec 31 '12 at 16:07