2

Possible Duplicate:
How to ignore whitespace in a regular expression subject string?

I am using the following:

"<a href=\"(.+?)\">(.+?)</a>"

to match:

"<a href="x">xx</a>"

However sometimes my users are entering the following:

"<a   href="x" >xx</a>"
"<a href="x">xx</a>"
"<a href="x"   >xx</a>"

How can I modify the regex so that it matches on one or many spaces in the three strings above?

Community
  • 1
  • 1
  • Please, don't post a link to that dont-parse-html-with-regex answer... – BlackBear Dec 20 '12 at 15:24
  • it is a simple answer if you would think only about regex, but usually to use regex to process html is not a good idea, because in your case user may add line break, add four spaces, add another attribute, etc. – Giedrius Dec 20 '12 at 15:24
  • Why do you want to use a regex? Why not use something like string.Replace()? – Liath Dec 20 '12 at 15:24
  • @BlackBear - Why not? There's good reasons for it. – Bobson Dec 20 '12 at 15:26
  • Check this one out: http://stackoverflow.com/questions/206717/how-do-i-replace-multiple-spaces-with-a-single-space-in-c – Davin Tryon Dec 20 '12 at 15:28
  • Or this one: http://stackoverflow.com/questions/1981349/regex-to-replace-multiple-spaces-with-a-single-space – Davin Tryon Dec 20 '12 at 15:28

4 Answers4

2

One solution would be to add \s* where a whitespace is legal but not required, and \s+ in places where whitespace is required, like this:

<a\\s+href\\s*=\\s*\"([^\"]*)\"\\s*>([^<]*)</a>

On the other hand, this is precisely an example of why one shouldn't attempt to parse XML or HTML with regex: it is simply a wrong tool for the job. Using one of several XML parsing techniques available in .NET would provide a much better alternative.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • +1 Using a verbatim string would make it much more readable though – BlackBear Dec 20 '12 at 15:27
  • @BlackBear I thought about it, but it would require me to double the quotes, so it's a tradeoff either way (although I agree that there are only three quotes and four `\s`s, so verbatim wins on points). – Sergey Kalinichenko Dec 20 '12 at 15:29
  • @dasblinkenlight - Can you tell me the names of some XML parsing techniques so I can google them and look into these. Thanks very much. –  Dec 20 '12 at 15:31
  • @Anne Look up "XmlReader" and `LINQ2XML`. There's also [this answer](http://stackoverflow.com/q/56107/335858) if you are interested in parsing HTML. – Sergey Kalinichenko Dec 20 '12 at 15:34
0

You can use the Negative lookahead assertion (?!\s) so it won't match if there is whitespace...

<a (?!\s)href=\"(?!\s)(.+?)\"(?!\s)>(?!\s)(.+?)</a>

But just from the amount of times this needs to be added, you can see that using a Regex for this is probably not the correct approach.

Blachshma
  • 17,097
  • 4
  • 58
  • 72
0

The symbol you want is +. will match one-or-more spaces.

<a +href=\"(.+?)\" *>(.+?)</a>

However, parsing html via regular expressions is generally a bad idea.

Bobson
  • 13,498
  • 5
  • 55
  • 80
0

This is a little funky and probably not the best, but here it goes:

string.Join(" ", s.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries))

edit: (I know it's not regex)

snurre
  • 3,045
  • 2
  • 24
  • 31