0

So I have a large volume of HTML text and I want to extract all of the data that is between every occurrence of <p> and </p> I have code that can locate the first occurrence of it and extract the first occurrence but can't seem to loop it.

I have tried for looping for the amount of times <p> will come up in the entire text.

I have tried looping it and deleting one occurrence and the text between(<p> and </p>) but that did not seem to work either

var startTag = $"<p>";
var endTag = $"</p>";
int count = 0;
string ImpureCText = "<p>hello this is the first part</p>fgbtfhsgs <p> this is the second part</p> <p> this is the third part</p>";

int index1 = ImpureCText.IndexOf(startTag);
int index2 = ImpureCText.IndexOf(endTag);
foreach (Match match in Regex.Matches(ImpureCText, startTag))
{
    count++;
}
Console.WriteLine("'{0}'" + " Found " + "{1}" + " Times", startTag, count);

for (int i = 0; i < count; i++)
{
    //Do code stuff
    string delete = ImpureCText.Remove(ImpureCText.IndexOf("<p>"), ImpureCText.IndexOf("</p>"));
    Console.WriteLine(delete);
}

Console.ReadKey();
Simson
  • 3,373
  • 2
  • 24
  • 38

1 Answers1

-1

Try a regular expression like <p>(.*?)</p>

Having said that, parsing html with regex could be considered bad style.

Example

string ImpureCText = "<p>hello this is the first part</p>fgbtfhsgs <p> this is the second part</p> <p> this is the third part</p>";

var matches = Regex.Matches(ImpureCText, "<p>(.*?)</p>");

foreach (var m in matches)
{
   Console.WriteLine(m.ToString());
}

prints

<p>hello this is the first part</p>
<p> this is the second part</p>
<p> this is the third part</p>

Edit

The 'bad style' refers to RegEx match open tags except XHTML self-contained tags (thanks @mjwills for finding it). Despite the funny accepted answer there, regex and html can successfully work together, especially when the parsed html is restricted.

tymtam
  • 31,798
  • 8
  • 86
  • 126
  • 1
    https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – mjwills Oct 19 '19 at 12:01