-8

How to retrieve all urls from all hrefs I don't want use HTML Agility Pack or similar - must be clean code and very short.

        HttpClient client = new HttpClient();
        static async Task Main(string[] args)
        {
            Program program = new Program();
            await program.GetTodoItems();
            await program.Function();
            Console.WriteLine("Hello Word!");
        }

        private async Task GetTodoItems()
        {
            string ResponseHtml = await client.GetStringAsync("https://example.com");

            var LinkParser = new Regex(@"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
            foreach (Match m in LinkParser.Matches(ResponseHtml))
            {
                Console.WriteLine(m.Value);
            }
        }

I expect clean urls not doubled and only for website not for scripts. This code show me some link with extra tags and char like this one:

https://example.com/libs/jquery/1.11.2/jquery.min.js">

https://www.google-analytics.com/analytics.js','ga

Adriaan
  • 17,741
  • 7
  • 42
  • 75
MESSIAH
  • 1
  • 6
  • 1
    You say you want to retrieve all urls, but your regex only matches strings starting with http: https: and www. This does not cover all urls you may encounter in an href – iakobski Oct 27 '19 at 08:54
  • 3
    Please don't make more work for others by vandalizing your posts. By posting on the Stack Exchange (SE) network, you've granted a non-revocable right, under a [CC BY-SA license](//creativecommons.org/licenses/by-sa/4.0), for SE to distribute the content (i.e. regardless of your future choices). By SE policy, the non-vandalized version is distributed. Thus, any vandalism will be reverted. Please see: [How does deleting work? …](//meta.stackexchange.com/q/5221). If permitted to delete, there's a "delete" button below the post, on the left, but it's only in browsers, not the mobile app. – Makyen Jan 26 '20 at 19:55
  • locked without definitive duration, since previous locks expired. Please don't play cat and mouse with moderators. – Jean-François Fabre Jun 05 '20 at 08:48

1 Answers1

1

Extend the capturing group around the "one or more not white space"

LinkParser = new Regex(@"\b(?<url>https?://\S+)['""]", RegexOptions.Compiled | RegexOptions.IgnoreCase);

Then access the match collection with

m.Groups["url"].Value

A simpler pattern might also work well: \b(?<url>http.*?)['"]

These are very primitive and I wouldn't guarantee it works in all cases. If you have urls that aren't quoted at all, consider adding Whitespace and close angle brackets to the end class. You'd be better off using a reliable library for this because ...

Ry-
  • 218,210
  • 55
  • 464
  • 476
Caius Jard
  • 72,509
  • 5
  • 49
  • 80
  • This does not solve the OP's problem, it still captures anything following the url up to the first whitespace. – iakobski Oct 27 '19 at 09:05