1

My Website flow.

  1. Authenticated user will upload docx.
  2. I am using OpenXmlPowerTools API to convert this docx to HTML
  3. Save the file
  4. Save each node of the html page into database.

Database:-

tblNodeCollection
  • NodeId
  • Node Type (Expected values - <p> , <h1>, <h3> , <table>)
  • NodeContent (Expected Value - <p> This is p content </p>

No issues till Step #3. But I am clueless on how to save the nodes collection into the table.

I googled & found HTMLAgiiltiyPack but don't know much about it.

using DocumentFormat.OpenXml.Packaging;
using HtmlAgilityPack;
using OpenXmlPowerTools;

namespace ExportData 
{
public class ExportHandler 
{
public void GenerateHTML()
    {
        byte[] byteArray = File.ReadAllBytes(@"d:\test.docx");
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument doc =
                WordprocessingDocument.Open(memoryStream, true))
            {
                HtmlConverterSettings settings = new HtmlConverterSettings()
                {
                    PageTitle = "My Page Title"
                };
                XElement html = HtmlConverter.ConvertToHtml(doc, settings);

                File.WriteAllText(@"d:\Test.html", html.ToStringNewLineOnAttributes());


            }
        }

        //now how do I proceed from here
    }
 }

Any type of help/guidance highly appreciated.

Kgn-web
  • 7,047
  • 24
  • 95
  • 161
  • 1
    Can we ask *why* you're trying to save the nodes in the database? Why not just save the whole XML and parse and process it in memory when needed? – Clint Dec 16 '16 at 12:09
  • @Clint No.
    The website has lots of other stuff to do with each node.
    – Kgn-web Dec 16 '16 at 12:09
  • any context on what that might be? It might entirely dictate the best solution. – Clint Dec 16 '16 at 12:27
  • The website is into eLearning. So the Trainer will upload a word file where each part of the page (node will have seprate reviewer & approver – Kgn-web Dec 16 '16 at 12:29
  • this may be a classic hammer & nail problem, is there any reason you couldn't just split the document into its constituent pages and store each page a separate document, and link them with entries in the database? That way you achieve the separation, *and* have the ability to bring it all back together in the end. The documents are compressed as well, so you're going to face serious data explosion with a large quantity of documents over time if you're storing per-node data. – Clint Dec 16 '16 at 12:32
  • @Clint, agreed but I am not the Boss of this system ;) – Kgn-web Dec 16 '16 at 12:36

2 Answers2

0

From the discussion we've had in the comments, and the part you seem to be stuck on, I'd recommend the following:

This Question here on SO may provide some help with how to convert to html.

Of course, you still face the issue of needing to be able to split each page (as you mentioned in the comments), you may be able to export each page to html individually.

As for your database structure, I'd recommend something akin to:

[Document Table]
  - Document ID
  - Document Name
  - Any other data you need per-document

[Node Table]
  - Node ID
  - Document ID (foreign key)
  - Node Content (string)

Make sure you've got sensible indexes on the node table as you're going to potentially be seeking across thousands if not millions of rows as time goes on (particularly one on the document id).

It might also be useful to have an index property against each node (e.g. a bigint position) so you can reconstitute a document by putting the nodes back together in order.

Overall though, my advice would be to try and make your boss see reason and really push against this silly design decision.

Community
  • 1
  • 1
Clint
  • 6,133
  • 2
  • 27
  • 48
0

Here is the simplified procedure how to parse html and save it to database. I hope this will help you and/or give you an idea how to solve your problem

        HtmlWeb h = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = h.Load("http://stackoverflow.com/questions/41183837/how-to-store-html-nodes-into-database");
        HtmlNodeCollection tableNodes = doc.DocumentNode.SelectNodes("//table");
        HtmlNodeCollection h1Nodes = doc.DocumentNode.SelectNodes("//h1");
        HtmlNodeCollection pNodes = doc.DocumentNode.SelectNodes("//p");
        //get other nodes here

        foreach (var pNode in pNodes)
        {
            string id = pNode.Id;
            string content = pNode.InnerText;
            string tag = pNode.Name;

            //do other stuff here and then save to database

            //just an example...
            SqlConnection conn = new SqlConnection("here goes conection string");
            SqlCommand cmd = new SqlCommand();
            cmd.Connection = conn;
            cmd.CommandText = "INSERT INTO tblNodeCollection (Tag, Id, Content) VALUES (@tag, @id, @content)";
            cmd.Parameters.Add("@tag", tag);
            cmd.Parameters.Add("@id", id);
            cmd.Parameters.Add("@content", content);

            cmd.ExecuteNonQuery();
        }
Nino
  • 6,931
  • 2
  • 27
  • 42