c# Regex problems

Question

I have this Regex which I'm working on

string addressstart = Regex.Escape("<a href=\"/url?q=");
                string addressend = Regex.Escape("&amp");
                string regAdd = addressstart + @"(.*?)" + addressend;

I'd like it to give me the url from this html

<a href="/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw">

so it should return "https://www.google.com/"

Any ideas Why it isnt working? thanks!

Does my answer below help? – Wiktor Stribiżew Mar 13 '17 at 21:40 — Wiktor Stribiżew, Mar 13 '17 at 21:40

score 2 · Accepted Answer · edited May 28 '20 at 16:11

2

The following regex worked for me. Make sure that you select group 1, since group 0 is always the full string.

@"<a href=\"\/url\?q=(.*?)&amp"

edited May 28 '20 at 16:11

Jeremy Caney

7,102
69
48
77

answered Mar 13 '17 at 00:35

Braedon Wooding

161
11

Thanks! When I try to use this though I get errors because of the quotation marks. Any Idea why this could be? – Darth123 Mar 13 '17 at 00:47

score 1 · Answer 2 · edited May 23 '17 at 12:17

As it appear you are looking for the url of google as part of your string. You might find useful the following pattern which will match it:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}

It is to be noted this is a small tweak of the general regex found at: What is a good regular expression to match a URL?

Edit Please see the code below in order to apply this regex and find the value you are looking for:

string input = "<a href=\"/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw\">";
var regex = new Regex(@"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}");
var output = regex.Match(input).Value; // https://www.google.com

If you are only matching the full thing then doing `https?:\/\/www\.[-a-zA-Z0-9@:%._\+~#=]{2,256}` and choosing the group 0, will work too. — Braedon Wooding, Mar 13 '17 at 01:25

score 1 · Answer 3 · answered Mar 13 '17 at 08:25

1

The problem is in the "<a href=\"/url?q=" part of the regular expression. The ? is not escaped. It means an optional l. Hence that part of the regular expresion matches either <a href="/urlq= or <a href="/urq=. Neither include the ? character.

answered Mar 13 '17 at 08:25

AdrianHHH

13,492
16
50
87

score 0 · Answer 4 · answered Mar 13 '17 at 07:49

When parsing HTML, you should consider using some HTML parser, like HtmlAgilityPack, and only after getting the necessary node, apply the regex on the plain text.

If you want to debug your own code, here is a fix:

using System;
using System.Text.RegularExpressions;

public class Test
{
    public static void Main()
    {
        var s = "<a href=\"/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw\">";
        var pattern = @"<a href=""/url\?q=(.*?)&amp;";
        var result = Regex.Match(s, pattern);
        if (result.Success)
            Console.WriteLine(result.Groups[1].Value);
    }
}

See a DotNetFiddle demo.

Here is an example how how you may extract all <a> href attribute values that start with /url?q= with HtmlAgilityPack. Install it via Solution > Manage NuGet Packages for Solution... and use

public List<string> HapGetHrefs(string html)
{
    var hrefs = new List<string>();
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes("//a[starts-with(@href, '/url?q=')]");
    if (nodes != null)
    {
       foreach (var node in nodes)
       {
           foreach (var attribute in node.Attributes)
               if (attribute.Name == "href")
               {
                   hrefs.Add(attribute.Value);
               }
        }
    }
    return hrefs;
 }

Then, all you need is apply a simpler regex or a couple of simpler string operations.

Omid Kashfi · Answer 5 · 2020-05-29T08:10:08.773

0

You can use:

(?<=a href="\/url\?q=)[^&]+

edited May 29 '20 at 08:10

answered May 28 '20 at 13:26

Omid Kashfi

21
3

What are the benefits to this approach over the accepted answer from two years ago? – Jeremy Caney May 28 '20 at 15:39

c# Regex problems

5 Answers5