1

I have this Regex which I'm working on

string addressstart = Regex.Escape("<a href=\"/url?q=");
                string addressend = Regex.Escape("&amp");
                string regAdd = addressstart + @"(.*?)" + addressend;

I'd like it to give me the url from this html

<a href="/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw">

so it should return "https://www.google.com/"

Any ideas Why it isnt working? thanks!

Darth123
  • 27
  • 7

5 Answers5

2

The following regex worked for me. Make sure that you select group 1, since group 0 is always the full string.

@"<a href=\"\/url\?q=(.*?)&amp"
Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77
  • Thanks! When I try to use this though I get errors because of the quotation marks. Any Idea why this could be? – Darth123 Mar 13 '17 at 00:47
1

As it appear you are looking for the url of google as part of your string. You might find useful the following pattern which will match it:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}

It is to be noted this is a small tweak of the general regex found at: What is a good regular expression to match a URL?

Edit Please see the code below in order to apply this regex and find the value you are looking for:

string input = "<a href=\"/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw\">";
var regex = new Regex(@"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}");
var output = regex.Match(input).Value; // https://www.google.com
Community
  • 1
  • 1
StfBln
  • 1,137
  • 6
  • 11
  • If you are only matching the full thing then doing `https?:\/\/www\.[-a-zA-Z0-9@:%._\+~#=]{2,256}` and choosing the group 0, will work too. – Braedon Wooding Mar 13 '17 at 01:25
1

The problem is in the "<a href=\"/url?q=" part of the regular expression. The ? is not escaped. It means an optional l. Hence that part of the regular expresion matches either <a href="/urlq= or <a href="/urq=. Neither include the ? character.

AdrianHHH
  • 13,492
  • 16
  • 50
  • 87
0

When parsing HTML, you should consider using some HTML parser, like HtmlAgilityPack, and only after getting the necessary node, apply the regex on the plain text.

If you want to debug your own code, here is a fix:

using System;
using System.Text.RegularExpressions;

public class Test
{
    public static void Main()
    {
        var s = "<a href=\"/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw\">";
        var pattern = @"<a href=""/url\?q=(.*?)&amp;";
        var result = Regex.Match(s, pattern);
        if (result.Success)
            Console.WriteLine(result.Groups[1].Value);
    }
}

See a DotNetFiddle demo.

Here is an example how how you may extract all <a> href attribute values that start with /url?q= with HtmlAgilityPack. Install it via Solution > Manage NuGet Packages for Solution... and use

public List<string> HapGetHrefs(string html)
{
    var hrefs = new List<string>();
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes("//a[starts-with(@href, '/url?q=')]");
    if (nodes != null)
    {
       foreach (var node in nodes)
       {
           foreach (var attribute in node.Attributes)
               if (attribute.Name == "href")
               {
                   hrefs.Add(attribute.Value);
               }
        }
    }
    return hrefs;
 }

Then, all you need is apply a simpler regex or a couple of simpler string operations.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

You can use:

(?<=a href="\/url\?q=)[^&]+