2

I am using regex for extracting url from string and it's working mostly;

var regex=new Regex("<a [^>]*href=(?:'(?<href>.*?)')|(?:\"(?<href>.*?)\")",RegexOptions.IgnoreCase);

following strings working fine:

"This is Test page <a href='test.aspx'>test page</a>"
"This is Test page <a href='test1.aspx'>test</a> another one <a href='test2.aspx'>test</a>"
"This is Tests\"s page <a href='test1.aspx'>test</a> another one <a href='test2.aspx'>test</a>"
"This is Test page"
"This is Test page\"s without problem"

But some time it's not returning good result. Following code return bad result (string contains 2 double quotes) -

var inputString="This string create \"problem\" for me";    
var regex=new Regex("<a [^>]*href=(?:'(?<href>.*?)')|(?:\"(?<href>.*?)\")",RegexOptions.IgnoreCase);    
var urls=regex.Matches(inputString).OfType<Match>().Select(m =>m.Groups["href"].Value);    
foreach(var zzzzzzz in urls){
  Console.WriteLine(zzzzzzz);
}

Demo with problem

Could anyone help me to solve this problem?

Ishan Jain
  • 8,063
  • 9
  • 48
  • 75
  • 2
    Should be using DOM parser, not regex, to get href from anchors – Drakes Jun 05 '15 at 05:39
  • I'm not sure what you have problem with - regex should *mostly* work to parse HTML. Make sure to carefully read actual answers in http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - it will help you with build plausible regular expressions... – Alexei Levenkov Jun 05 '15 at 05:45
  • @ Alexei Levenkov: When you run code with problematic string (i.e string with double quotas). it's not return the correct result. (in the above example in the string there is no "href" but code return "problem" text) – Ishan Jain Jun 05 '15 at 05:51
  • @Drakes: Actually i want C# code, that's why i used regex. – Ishan Jain Jun 05 '15 at 05:58
  • Yes, you should expect that there are plenty version of valid HTML that will not be handled by basically any regular expression you can come up with. Assuming you can't use existing HTML parser you should come up with all cases you care about and start carefully test multiple regular expressions. You should be asking *specific* questions (like "detect 3 double quotes in a row") as "parse HTML with regex is generally too broad/duplicate. – Alexei Levenkov Jun 05 '15 at 05:59
  • Do you consider using an HTML parser? If yes then [see here](http://stackoverflow.com/a/30630282/3832970). You will need to remove constraints on extensions and it will collect all links and even more. – Wiktor Stribiżew Jun 05 '15 at 06:04
  • HTML parsing in C#: http://www.codeproject.com/Tips/804660/How-to-Parse-HTML-using-Csharp – Drakes Jun 05 '15 at 06:04
  • @stribizhev: no, till now i not used any HTML parser – Ishan Jain Jun 05 '15 at 06:08
  • @Drakes: I don't think my question is duplicate (you can see i also update my question title); I don't want to use any HTML parser external library because of my senior not prefer that. So i only want to ask here if there is any solution exist with regex. – Ishan Jain Jun 05 '15 at 06:16
  • @Drakes:Yes, you are right for using HTML parser because it is safe to use when working on HTML element. but really, i don't want to use a external library if possible. – Ishan Jain Jun 05 '15 at 06:22

2 Answers2

1

Maybe you can change your regex like this:<a .*?href=(?:['"](?<href>[^'"]*?)['"]) On Csharp:"<a .*?href=(?:['\"](?<href>[^'\"]*?)['\"])"

Carey Tzou
  • 77
  • 5
0

Solution:

You should use an HTML Parser to get rid of current and further headaches. A tested and working example can be found for example here.

Regex explanation:

As for your regex, it currently fails because of alternation that you did not enclose into a group. Thus, it can return strings that have no <a... href inside them. More, there are other issues that you can have with your current regex.

A "fixed" regex (meaning it will be capable of handling escaped entities and both double and single quotes) would look like:

(?i)<a\b[^<]*href=(?:(?:'(?<href>[^'\\]*(?:\\.[^'\\]*)*)')|(?:\"(?<href>[^'\\]*(?:\\.[^'\\]*)*))\")

But it is unlikely you can fully rely on regex when parsing HTML. Use the solution, not a workaround.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563