0

I'm currently writing a very basic program that'll firstly go through the html code of a website to find all RSS Links, and thereafter put the RSS Links into an array and parse each content of the links into an existing XML file.

However, I'm still learning C# and I'm not that familiar with all the classes yet. I have done all this in PHP by writing own class with get_file_contents() and as well been using cURL to do the work. I managed to get around it with Java also. Anyhow, I'm trying to accomplish the same results by using C#, but I think I'm doing something wrong here.

TLDR; What's the best way to write the regex to find all RSS links on a website?

So far, my code looks like this:

        private List<string> getRSSLinks(string websiteUrl)
    {
        List<string> links = new List<string>();
        MatchCollection collection = Regex.Matches(websiteUrl, @"(<link.*?>.*?</link>)", RegexOptions.Singleline);

        foreach (Match singleMatch in collection)
        {
            string text = singleMatch.Groups[1].Value;
            Match matchRSSLink = Regex.Match(text, @"type=\""(application/rss+xml)\""", RegexOptions.Singleline);
            if (matchRSSLink.Success)
            {
                links.Add(text);
            }
        }

        return links;
    }
Nikkster
  • 327
  • 2
  • 13

1 Answers1

0

Don't use Regex to parse html. Use an html parser instead See this link for the explanation

I prefer HtmlAgilityPack to parse htmls

using (var client = new WebClient())
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(client.DownloadString("http://www.xul.fr/en-xml-rss.html"));

    var rssLinks = doc.DocumentNode.Descendants("link")
        .Where(n => n.Attributes["type"] != null && n.Attributes["type"].Value == "application/rss+xml")
        .Select(n => n.Attributes["href"].Value)
        .ToArray();
}
Community
  • 1
  • 1
L.B
  • 114,136
  • 19
  • 178
  • 224