How can I ignore additional spaces in a string using regex and c#?

Question

Possible Duplicate:
How to ignore whitespace in a regular expression subject string?

I am using the following:

"<a href=\"(.+?)\">(.+?)</a>"

to match:

"<a href="x">xx</a>"

However sometimes my users are entering the following:

"<a   href="x" >xx</a>"
"<a href="x">xx</a>"
"<a href="x"   >xx</a>"

How can I modify the regex so that it matches on one or many spaces in the three strings above?

Please, don't post a link to that dont-parse-html-with-regex answer... — BlackBear, Dec 20 '12 at 15:24
it is a simple answer if you would think only about regex, but usually to use regex to process html is not a good idea, because in your case user may add line break, add four spaces, add another attribute, etc. — Giedrius, Dec 20 '12 at 15:24
Why do you want to use a regex? Why not use something like string.Replace()? — Liath, Dec 20 '12 at 15:24
Check this one out: http://stackoverflow.com/questions/206717/how-do-i-replace-multiple-spaces-with-a-single-space-in-c — Davin Tryon, Dec 20 '12 at 15:28
Or this one: http://stackoverflow.com/questions/1981349/regex-to-replace-multiple-spaces-with-a-single-space — Davin Tryon, Dec 20 '12 at 15:28

score 2 · Accepted Answer · answered Dec 20 '12 at 15:24

2

One solution would be to add \s* where a whitespace is legal but not required, and \s+ in places where whitespace is required, like this:

<a\\s+href\\s*=\\s*\"([^\"]*)\"\\s*>([^<]*)</a>

On the other hand, this is precisely an example of why one shouldn't attempt to parse XML or HTML with regex: it is simply a wrong tool for the job. Using one of several XML parsing techniques available in .NET would provide a much better alternative.

answered Dec 20 '12 at 15:24

Sergey Kalinichenko

714,442
84
1,110
1,523

+1 Using a verbatim string would make it much more readable though – BlackBear Dec 20 '12 at 15:27
@BlackBear I thought about it, but it would require me to double the quotes, so it's a tradeoff either way (although I agree that there are only three quotes and four `\s`s, so verbatim wins on points). – Sergey Kalinichenko Dec 20 '12 at 15:29
@dasblinkenlight - Can you tell me the names of some XML parsing techniques so I can google them and look into these. Thanks very much. – Dec 20 '12 at 15:31
@Anne Look up "XmlReader" and `LINQ2XML`. There's also [this answer](http://stackoverflow.com/q/56107/335858) if you are interested in parsing HTML. – Sergey Kalinichenko Dec 20 '12 at 15:34

Blachshma · Answer 2 · 2012-12-20T15:30:21.323

0

You can use the Negative lookahead assertion (?!\s) so it won't match if there is whitespace...

<a (?!\s)href=\"(?!\s)(.+?)\"(?!\s)>(?!\s)(.+?)</a>

But just from the amount of times this needs to be added, you can see that using a Regex for this is probably not the correct approach.

edited Dec 20 '12 at 15:30

answered Dec 20 '12 at 15:25

Blachshma

17,097
4
58
72

score 0 · Answer 3 · answered Dec 20 '12 at 15:25

0

The symbol you want is +. will match one-or-more spaces.

<a +href=\"(.+?)\" *>(.+?)</a>

However, parsing html via regular expressions is generally a bad idea.

answered Dec 20 '12 at 15:25

Bobson

13,498
5
55
80

score 0 · Answer 4 · answered Dec 20 '12 at 15:25

0

This is a little funky and probably not the best, but here it goes:

string.Join(" ", s.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries))

edit: (I know it's not regex)

answered Dec 20 '12 at 15:25

snurre

3,045
2
24
31

How can I ignore additional spaces in a string using regex and c#?

4 Answers4