How can I read an HTML file a Paragraph at a time?

Question

I reckon it would be something like (pseudocode):

var pars = new List<string>();
string par;
while (not eof("Platypus.html"))
{
    par = getNextParagraph();
    pars.Add(par);
}

...where getNextParagraph() looks for the next "<p>" and continues until it finds "</p>", burning its bridges behind it ("cutting" the paragraph so that it is not found over and over again). Or some such.

Does anybody have insight on how exactly to do this / a better methodology?

UPDATE

I tried to use Aurelien Souchet's code.

I have the following usings:

using HtmlAgilityPack;
using HtmlDocument = System.Windows.Forms.HtmlDocument;

...but this code:

HtmlDocument doc = new HtmlDocument();

is unwanted ("Cannot access private constructor 'HtmlDocument' here")

Also, both "doc.LoadHtml()" and "doc.DocumentNode" give the old "Cannot resolve symbol 'Bla'" err msg

UPDATE 2

Okay, I had to prepend "HtmlAgilityPack." so that the ambiguous reference was disambiguated.

Look this [question](http://stackoverflow.com/questions/4752840/html-agility-pack-c-sharp-paragraph-parsing-problem) on StackOverflow — djluis, Feb 12 '14 at 22:47
Do you have control over the html you are trying to parse? How do you determine what a "Paragraph" is? P tags? What about web developers that aren't that consistent? Do you look for DIV tags? BR tags? Are you guaranteed consistency in what you're reading? If not, you're talking a monmental task. The question itself is generic and boils down to "how to parse html", and questions about parsing html have been covered here many times... http://stackoverflow.com/search?q=parse+html — David, Feb 12 '14 at 22:48
Of course, previous note is void if you JUST want P tags and aren't concerned with the fact that there are plenty of developers out there who do whatever they want, not following any given standard. — David, Feb 12 '14 at 22:49
If you don't mind using Html Agility Pack I think you can get a collection of the paragraph tags that you can then iterate. http://htmlagilitypack.codeplex.com/ — jac, Feb 12 '14 at 22:54
Just use the HTMLAgility pack and select all the p tags would seem to be a good first try. Read all the paragraphs in a HTML file would be a better way to state the problem as well, seeing the file doesn't have to have any. PS what if they are nested? — Tony Hopkinson, Feb 12 '14 at 22:55

score 5 · Accepted Answer · answered Feb 13 '14 at 00:06

As people suggests in the comments, I think HtmlAgilityPack is the best choice, it's easy to use and to find good examples or tutorials.

Here is what I would write:

//don't forgot to add the reference
using HtmlAgilityPack;

//Function that takes the html as a string in parameter and return a list
//of strings with the paragraphs content.
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{

   var pars = new List<string>();

   //first create an HtmlDocument
   HtmlDocument doc = new HtmlDocument();

   //load the html (from a string)
   doc.LoadHtml(sourceHtml);

   //Select all the <p> nodes in a HtmlNodeCollection
   HtmlNodeCollection paragraphs = doc.DocumentNode.SelectNodes(".//p");

   //Iterates on every Node in the collection
   foreach (HtmlNode paragraph in paragraphs)
   {
      //Add the InnerText to the list
      pars.Add(paragraph.InnerText); 
      //Or paragraph.InnerHtml depends what you want
   }

   return pars;
}

It's just a basic example, you can have some nested paragraphs in your html then this code maybe won't work as expected, it all depends the html you are parsing and what you want to do with it.

Hope it helps!

Looks promising, thanks! What's the diff between InnerText and InnerHTML? I see it here: http://stackoverflow.com/questions/19030742/difference-between-innertext-and-innerhtml-in-javascript — B. Clay Shannon-B. Crow Raven, Feb 13 '14 at 00:08
I think InnerHtml would give everything between the
tags including other html like etc ... when InnerText just gives the text with no Html. But try both to see by yourself, also there is often some white spaces at the beginning or the end so you might want to use .Trim() — Aurelien Souchet, Feb 13 '14 at 00:13
See my answer here: http://stackoverflow.com/questions/21788078/why-is-this-htmlagilitypack-operation-invalid-when-there-are-indeed-matching-e/21791169#21791169 — B. Clay Shannon-B. Crow Raven, Feb 14 '14 at 23:31

How can I read an HTML file a Paragraph at a time?

UPDATE

UPDATE 2

1 Answers1