-1

I have big document with the following rigid structure:

<h1>Title 1</h1>
Article text
<h1>Title 2</h1>
Article text
<h1>Title 3</h1>
Article text

My aim is to create a list of lists always with title and the following article text up to the next title.

I tried:

var parts = Regex.Split(html2, @"(<h1>)").Where(l => l !=string.Empty).ToArray().Select(a => Regex.Split(a, @"(</h1>)")).ToArray();

But the result is not as expected. Any Ideas how to split the separate articles and the titles? Thanks!

IDP
  • 29
  • 3

2 Answers2

2

Parsing HTML with Regex is a bad idea, as described by this classic answer: https://stackoverflow.com/a/1732454/173322

If you had a specific tag to extract then it could work but since you have multiple tags and freeform article text in-between, I suggest you use a parsing engine like AngleSharp or HtmlAgilityPack. It'll be faster and more reliable.

If you must stick with manual text parsing, I would simply loop through each line, check if it starts with an <h1> tag, classify the lines as Titles or Article Text, then loop through again to strip out the tags from the Titles and pair with the Article text.

Mani Gandham
  • 7,688
  • 1
  • 51
  • 60
1

As mentioned in the comment, you should use a HTML parse, but, if you want to give it a try with code, you could split the string, determine whether the splitted text is a title or an article and then, add the result on a list.

However, for this task you have to:

NOTE: This code assumes the string (i.e. your document's content) has equal amounts of titles and articles.

Here's the code I've made - hosted on dotnetfiddle.com as well:

// Variables: 
string sample = "<h1>Title 1</h1>" + "Article text" + "<h1>Title 2</h1>" + "Article text" + "<h1>Title 3</h1>" + "Article text";

// string.split - by multiple character delimiter
// Credit: https://stackoverflow.com/a/1254596/12511801
string[] arr = sample.Split(new string[]{"</h1>"}, StringSplitOptions.None);

// I store the "title" and "article" in separated lists - their content will be unified later:
List<string> titles = new List<string>();
List<string> articles = new List<string>();

// Loop the splitted text by "</h1>": 
foreach (string s in arr)
{
    if (s.StartsWith("<h1>"))
    {
        titles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[1]);
    }
    else
    {
        if (s.Contains("<h1>"))
        {
            // Position 0 is the article and the 1 position is the title: 
            articles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[0]);
            titles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[1]);
        }
        else
        {
            // Leading text - it's an article by default.
            articles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[0]);
        }
    }
}

// ------------
// Create a list of lists.
// Credit: https://stackoverflow.com/a/12628275/12511801
List<List<string>> myList = new List<List<string>>();
for (int i = 0; i < titles.Count; i++)
{
    myList.Add(new List<string>{"Title: " + titles[i], "Article: " + articles[i]});
}

// Print the results: 
foreach (List<string> subList in myList)
{
    foreach (string item in subList)
    {
        Console.WriteLine(item);
    }
}

Result:

Title: Title 1
Article: Article text
Title: Title 2
Article: Article text
Title: Title 3
Article: Article text