How to parse HTML nodes

Question

My Website flow.

Authenticated user will upload docx.
I am using OpenXmlPowerTools API to convert this docx to HTML
Save the file
Save each node of the html page into database.

Database:-

tblNodeCollection

NodeId
Node Type (Expected values - <p> , <h1>, <h3> , <table>)
NodeContent (Expected Value - <p> This is p content </p>

No issues till Step #3. But I am clueless on how to save the nodes collection into the table.

I googled & found HTMLAgiiltiyPack but don't know much about it.

using DocumentFormat.OpenXml.Packaging;
using HtmlAgilityPack;
using OpenXmlPowerTools;

namespace ExportData 
{
public class ExportHandler 
{
public void GenerateHTML()
    {
        byte[] byteArray = File.ReadAllBytes(@"d:\test.docx");
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument doc =
                WordprocessingDocument.Open(memoryStream, true))
            {
                HtmlConverterSettings settings = new HtmlConverterSettings()
                {
                    PageTitle = "My Page Title"
                };
                XElement html = HtmlConverter.ConvertToHtml(doc, settings);

                File.WriteAllText(@"d:\Test.html", html.ToStringNewLineOnAttributes());


            }
        }

        //now how do I proceed from here
    }
 }

Any type of help/guidance highly appreciated.

Can we ask *why* you're trying to save the nodes in the database? Why not just save the whole XML and parse and process it in memory when needed? — Clint, Dec 16 '16 at 12:09
@Clint No.
The website has lots of other stuff to do with each node. — Kgn-web, Dec 16 '16 at 12:09
any context on what that might be? It might entirely dictate the best solution. — Clint, Dec 16 '16 at 12:27
The website is into eLearning. So the Trainer will upload a word file where each part of the page (node will have seprate reviewer & approver — Kgn-web, Dec 16 '16 at 12:29
this may be a classic hammer & nail problem, is there any reason you couldn't just split the document into its constituent pages and store each page a separate document, and link them with entries in the database? That way you achieve the separation, *and* have the ability to bring it all back together in the end. The documents are compressed as well, so you're going to face serious data explosion with a large quantity of documents over time if you're storing per-node data. — Clint, Dec 16 '16 at 12:32

score 0 · Answer 1 · edited May 23 '17 at 12:13

From the discussion we've had in the comments, and the part you seem to be stuck on, I'd recommend the following:

This Question here on SO may provide some help with how to convert to html.

Of course, you still face the issue of needing to be able to split each page (as you mentioned in the comments), you may be able to export each page to html individually.

As for your database structure, I'd recommend something akin to:

[Document Table]
  - Document ID
  - Document Name
  - Any other data you need per-document

[Node Table]
  - Node ID
  - Document ID (foreign key)
  - Node Content (string)

Make sure you've got sensible indexes on the node table as you're going to potentially be seeking across thousands if not millions of rows as time goes on (particularly one on the document id).

It might also be useful to have an index property against each node (e.g. a bigint position) so you can reconstitute a document by putting the nodes back together in order.

Overall though, my advice would be to try and make your boss see reason and really push against this silly design decision.

But how do I split my html page into nodes. That's my doubt – Kgn-web Dec 16 '16 at 13:21 — Kgn-web, Dec 16 '16 at 13:21

score 0 · Accepted Answer · answered Dec 16 '16 at 13:26

Here is the simplified procedure how to parse html and save it to database. I hope this will help you and/or give you an idea how to solve your problem

        HtmlWeb h = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = h.Load("http://stackoverflow.com/questions/41183837/how-to-store-html-nodes-into-database");
        HtmlNodeCollection tableNodes = doc.DocumentNode.SelectNodes("//table");
        HtmlNodeCollection h1Nodes = doc.DocumentNode.SelectNodes("//h1");
        HtmlNodeCollection pNodes = doc.DocumentNode.SelectNodes("//p");
        //get other nodes here

        foreach (var pNode in pNodes)
        {
            string id = pNode.Id;
            string content = pNode.InnerText;
            string tag = pNode.Name;

            //do other stuff here and then save to database

            //just an example...
            SqlConnection conn = new SqlConnection("here goes conection string");
            SqlCommand cmd = new SqlCommand();
            cmd.Connection = conn;
            cmd.CommandText = "INSERT INTO tblNodeCollection (Tag, Id, Content) VALUES (@tag, @id, @content)";
            cmd.Parameters.Add("@tag", tag);
            cmd.Parameters.Add("@id", id);
            cmd.Parameters.Add("@content", content);

            cmd.ExecuteNonQuery();
        }

Your post seems highly relevant to my need. Let me further check this. Thanks :) — Kgn-web, Dec 16 '16 at 13:29
Here's an upwote, but it seems that the real question is how to use HtmlAgilityPack to parse HTML :) — Nino, Dec 16 '16 at 14:48
I was clueless, what should I do get the nodes of the HTML page. But after your post I got to know how to use the API. Thanks mate — Kgn-web, Dec 16 '16 at 14:52
Can you please check this post. http://stackoverflow.com/questions/41220362/how-to-rebuilt-html-from-parse-nodes — Kgn-web, Dec 19 '16 at 10:24

How to parse HTML nodes

2 Answers2