Regex for extract URL from string fails when string contains multiple double quotes?

Question

I am using regex for extracting url from string and it's working mostly;

var regex=new Regex("<a [^>]*href=(?:'(?<href>.*?)')|(?:\"(?<href>.*?)\")",RegexOptions.IgnoreCase);

following strings working fine:

"This is Test page <a href='test.aspx'>test page</a>"
"This is Test page <a href='test1.aspx'>test</a> another one <a href='test2.aspx'>test</a>"
"This is Tests\"s page <a href='test1.aspx'>test</a> another one <a href='test2.aspx'>test</a>"
"This is Test page"
"This is Test page\"s without problem"

But some time it's not returning good result. Following code return bad result (string contains 2 double quotes) -

var inputString="This string create \"problem\" for me";    
var regex=new Regex("<a [^>]*href=(?:'(?<href>.*?)')|(?:\"(?<href>.*?)\")",RegexOptions.IgnoreCase);    
var urls=regex.Matches(inputString).OfType<Match>().Select(m =>m.Groups["href"].Value);    
foreach(var zzzzzzz in urls){
  Console.WriteLine(zzzzzzz);
}

Demo with problem

Could anyone help me to solve this problem?

Should be using DOM parser, not regex, to get href from anchors — Drakes, Jun 05 '15 at 05:39
I'm not sure what you have problem with - regex should *mostly* work to parse HTML. Make sure to carefully read actual answers in http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - it will help you with build plausible regular expressions... — Alexei Levenkov, Jun 05 '15 at 05:45
@ Alexei Levenkov: When you run code with problematic string (i.e string with double quotas). it's not return the correct result. (in the above example in the string there is no "href" but code return "problem" text) — Ishan Jain, Jun 05 '15 at 05:51
Yes, you should expect that there are plenty version of valid HTML that will not be handled by basically any regular expression you can come up with. Assuming you can't use existing HTML parser you should come up with all cases you care about and start carefully test multiple regular expressions. You should be asking *specific* questions (like "detect 3 double quotes in a row") as "parse HTML with regex is generally too broad/duplicate. — Alexei Levenkov, Jun 05 '15 at 05:59
Do you consider using an HTML parser? If yes then [see here](http://stackoverflow.com/a/30630282/3832970). You will need to remove constraints on extensions and it will collect all links and even more. — Wiktor Stribiżew, Jun 05 '15 at 06:04
HTML parsing in C#: http://www.codeproject.com/Tips/804660/How-to-Parse-HTML-using-Csharp — Drakes, Jun 05 '15 at 06:04
possible duplicate of [c# regular expression for finding links in with specific ending](http://stackoverflow.com/questions/30629793/c-sharp-regular-expression-for-finding-links-in-a-with-specific-ending) — Drakes, Jun 05 '15 at 06:05
@Drakes: I don't think my question is duplicate (you can see i also update my question title); I don't want to use any HTML parser external library because of my senior not prefer that. So i only want to ask here if there is any solution exist with regex. — Ishan Jain, Jun 05 '15 at 06:16
@Drakes:Yes, you are right for using HTML parser because it is safe to use when working on HTML element. but really, i don't want to use a external library if possible. — Ishan Jain, Jun 05 '15 at 06:22

Carey Tzou · Answer 1 · 2015-06-05T06:10:57.043

1

Maybe you can change your regex like this:<a .*?href=(?:['"](?<href>[^'"]*?)['"]) On Csharp:"<a .*?href=(?:['\"](?<href>[^'\"]*?)['\"])"

edited Jun 05 '15 at 06:10

answered Jun 05 '15 at 05:45

Carey Tzou

77
5

This will fail with `"text 'text' and more text"`. – Wiktor Stribiżew Jun 05 '15 at 06:00
[Do not use regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Drakes Jun 05 '15 at 06:06
you can list the text which is possible – Carey Tzou Jun 05 '15 at 06:10
Can you list the text which is possible? @Drakes Nobody say this is a html string. – Carey Tzou Jun 05 '15 at 06:16
Oh my, we should call Sir Tim and let him know that this isn't an HTML fragment after all: "This is Test page test page" – Drakes Jun 05 '15 at 07:05

score 0 · Answer 2 · edited May 23 '17 at 12:05

Solution:

You should use an HTML Parser to get rid of current and further headaches. A tested and working example can be found for example here.

Regex explanation:

As for your regex, it currently fails because of alternation that you did not enclose into a group. Thus, it can return strings that have no <a... href inside them. More, there are other issues that you can have with your current regex.

A "fixed" regex (meaning it will be capable of handling escaped entities and both double and single quotes) would look like:

(?i)<a\b[^<]*href=(?:(?:'(?<href>[^'\\]*(?:\\.[^'\\]*)*)')|(?:\"(?<href>[^'\\]*(?:\\.[^'\\]*)*))\")

But it is unlikely you can fully rely on regex when parsing HTML. Use the solution, not a workaround.

Regex for extract URL from string fails when string contains multiple double quotes?

2 Answers2

Solution:

Regex explanation:

Linked