-3

I want to pull an HTML document from a website and then "edit" it using code (regex) in C#. For this I am trying to create a regular expression. My goal is to get everything in the Document between the Article-Tags.

The HTML document is only an excerpt. In this are many articles-Tags.

Unfortunately without success so far, maybe you have an idea for it.

My Try so far:

<article(.*)>(<("[^"]*"|'[^']*'|[^'">])*>|(\n)|)*<\/article>

The HTML-Document looks like this:

<div></div>

<div class="entry-content">
</div>
<div class="kt_archivecontent  "  > 

<article id="post-5555">
<div class="row">
<div class="col-md-12 ">
<div class="post-text-inner">
<div class="/" rel="category tag">Allgemein</a></div><header>
<a >Some Text</h3></a><div class="post-top-meta kt_color_gray">
<span class="postdate kt-post-date updated">
25. Juli 2022</span>
<span class="postauthortop kt-post-author author vcard">
<span class="kt-by-author">by</span><span itemprop="author">
<a href="https://some link" class="fn kt_color_gray" rel="author">
Some text</a>
</span>
</span> 
</div>
</header>
<div class="entry-content">
<p>some Text<a>Read More</a></p>
</div>
<footer>
</footer>
</div>
</div>
</div>
</article>

<div></div>
Martin Brown
  • 24,692
  • 14
  • 77
  • 122
TheGameZ
  • 23
  • 3
  • 3
    Don't use Regex in the first place. Use an HTML-parsing library like AngleSharp to parse the document and then select elements using XPath, CSS, or element IDs – Panagiotis Kanavos Sep 05 '22 at 12:35
  • 1
    Regex is meant for Regular Expressions and HTML is not regular. – jdweng Sep 05 '22 at 12:40
  • 1
    Required reading https://stackoverflow.com/a/1732454/14868997 You should use something like HtmlAgilityPack to parse it, not Regex – Charlieface Sep 05 '22 at 14:54

2 Answers2

0

Assuming your HTML is actually valid XHTML, you could parse it using the XDocument class:

string html = "<div></div>" +

    "<div class=\"entry-content\">" +
    "</div>" +
    "<div class=\"kt_archivecontent  \"  > " +

    "<article id=\"post-5555\">" +
    "<div class=\"row\">" +
    "<div class=\"col-md-12 \">" +
    "<div class=\"post-text-inner\">" +
    "<div class=\"/\" rel=\"category tag\"><a>Allgemein</a></div><header>" +
    "<a ><h3>Some Text</h3></a><div class=\"post-top-meta kt_color_gray\">" +
    "<span class=\"postdate kt-post-date updated\">" +
    "25. Juli 2022</span>" +
    "<span class=\"postauthortop kt-post-author author vcard\">" +
    "<span class=\"kt-by-author\">by</span><span itemprop=\"author\">" +
    "<a href=\"https://some link\" class=\"fn kt_color_gray\" rel=\"author\">" +
    "Some text</a>" +
    "</span>" +
    "</span> " +
    "</div>" +
    "</header>" +
    "<div class=\"entry-content\">" +
    "<p>some Text<a>Read More</a></p>" +
    "</div>" +
    "<footer>" +
    "</footer>" +
    "</div>" +
    "</div>" +
    "</div>" +
    "</article></div>" +

    "<div></div>";

XDocument doc = XDocument.Parse("<root>" + html + "</root>");
XElement article = doc.Descendants("article").FirstOrDefault();
string s = article?.ToString();
mm8
  • 163,881
  • 10
  • 57
  • 88
0

As others have said parsing HTML with regular expressions comes with a health warning because it can go wrong in a number of ways. Particularly when there are comments in the HTML or nested tags. That said if you are just knocking up something scrappy and know the restrictions on the way the HTML was generated it can be an easy and quick way to go.

I suspect the issue you ran into is that the dot '.' character does not match new lines by default. In order to get it to do so you need to use the SingleLine option. Also, to stop it treating multiple articles as one capture, you would need to make the match non-greedy by adding a '?' after the '*'.

Something like this:

using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    internal class Program
    {
        static void Main(string[] args)
        {
            var testString = @"<div></div>

<div class=""entry-content"">
</div>
<div class=""kt_archivecontent  ""  > 

<article id=""post-5555"">
<div class=""row"">
<div class=""col-md-12 "">
<div class=""post-text-inner"">
<div class=""/"" rel=""category tag"">Allgemein</a></div><header>
<a >Some Text</h3></a><div class=""post-top-meta kt_color_gray"">
<span class=""postdate kt-post-date updated"">
25. Juli 2022</span>
<span class=""postauthortop kt-post-author author vcard"">
<span class=""kt-by-author"">by</span><span itemprop=""author"">
<a href=""https://some link"" class=""fn kt_color_gray"" rel=""author"">
Some text</a>
</span>
</span> 
</div>
</header>
<div class=""entry-content"">
<p>some Text<a>Read More</a></p>
</div>
<footer>
</footer>
</div>
</div>
</div>
</article>

<article id=""post-5556"">
<div class=""row"">
<div class=""col-md-12 "">
<div class=""post-text-inner"">
<div class=""/"" rel=""category tag"">Allgemein</a></div><header>
<a >Some Text</h3></a><div class=""post-top-meta kt_color_gray"">
<span class=""postdate kt-post-date updated"">
25. Juli 2022</span>
<span class=""postauthortop kt-post-author author vcard"">
<span class=""kt-by-author"">by</span><span itemprop=""author"">
<a href=""https://some link"" class=""fn kt_color_gray"" rel=""author"">
Some text</a>
</span>
</span> 
</div>
</header>
<div class=""entry-content"">
<p>some Text<a>Read More</a></p>
</div>
<footer>
</footer>
</div>
</div>
</div>
</article>

<div></div>";

            var matches = Regex.Matches(testString, "<article(.*?)>.*?<\\/article>",RegexOptions.Singleline);

            foreach (var match in matches)
            {
                Console.WriteLine($"{match}\n{new String('=', 80)}");
            }
        }
    }
}

Another way to apply the SingleLine option is to prepend it to the regular expression like this:

var matches = Regex.Matches(testString, "(?s)<article(.*?)>.*?<\\/article>");

If you want to be helpful to any future developers that maintain the code you can tell the pattern to ignore white space and add comments like this:

var matches = Regex.Matches(
            testString,
            @"(?sx)       # Options single line (s) and ignore pattern whitespace (x)
            <             # Match the opening of a tag
            article       # Match the name of the article tag
            (.*?)         # Match the attributes of the article tag if there are any
            >             # Match the end of the opening article tag
            .*?           # Match the minimum number of characters required to create a match
            <\/article>   # Match the closing </article> tag");
Martin Brown
  • 24,692
  • 14
  • 77
  • 122