1

I'm trying to learn what the proper or best way to pull data from an html table and import it into a sql table. Every week we get a html document that I must insert into a table. I usually just use sql management to import it into a blank table then merge it with the current table. I know some c# so I wanted to create an importer to automated a bit.

I was thinking of just reading each line and lopping through looking for and and insert the data like that. Is that the best way, or is there a better way to do it?

Thanks

Here is example of the html file. The 1st columns are the column names.

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style>
    br
    {mso-data-placement:same-cell;}
</style>
</head>
<body>
<table border="1">
<tr><td><b>#</b></td>
<td><b>Asset Manager</b></td>
<td><b>Billing Address</b></td>
<td><b>Billing City</b></td>
<td><b>Billing State</b></td>
<td><b>Billing Zip Code</b></td>
<td><b>Contract Amount</b></td>
<td><b>DUNS Number</b></td>
<td><b>FEIN</b></td>
</tr>
<tr>
<td>1</td>
<td  style="mso-number-format:\@">Jim Bob</td>
<td  style="mso-number-format:\@">2500 N. Park Pkwy, Suite 600</td>
<td  style="mso-number-format:\@">Plano</td>
<td>Texas</td>
<td  style="mso-number-format:\@">75093</td>
<td>$0.00</td>
<td  style="mso-number-format:\@"></td>
<td  style="mso-number-format:\@"></td>
</tr>

</table>
</body>
</html>

So far I created a button that will grab the document name. Also have the SQLConnection set to the correct server.

private void buttonBrowse_Click(object sender, EventArgs e)
{
        var DB = new System.Windows.Forms.OpenFileDialog();
        if (DB.ShowDialog() == System.Windows.Forms.DialogResult.OK)
        {
            string fileToOpen = DB.FileName;
            textBoxImport.Text = fileToOpen;

        }
    }
hutchonoid
  • 32,982
  • 15
  • 99
  • 104
Zach
  • 348
  • 3
  • 7
  • 22
  • This is a link to an old question of mine which explains how to parse tables with the html agility pack: http://stackoverflow.com/questions/655603/html-agility-pack-parsing-tables. – weismat Jul 03 '13 at 14:22
  • Thanks for the link. I didn't see that one while searching. – Zach Jul 03 '13 at 14:25

5 Answers5

3

As other have mentioned you can use HtmlAgilityPack.

Here is an example:

DTO:

public class Customer
{
    public int Id { get; set; }
    public string Name { get; set; }
    public string Address { get; set; }
    public string City { get; set; }
    public string State { get; set; }
    public string Zip { get; set; }
    public decimal Amount { get; set; }
    // etc
}

Usage:

var data = @"<html>
<head>
    <meta http-equiv=""Content-Type"" content=""text/html; charset=UTF-8"" >
<style>
    br
    {mso-data-placement:same-cell;}
</style>
</head>
<body>
<table border=""1"">
<tr><td><b>#</b></td>
<td><b>Asset Manager</b></td>
<td><b>Billing Address</b></td>
<td><b>Billing City</b></td>
<td><b>Billing State</b></td>
<td><b>Billing Zip Code</b></td>
<td><b>Contract Amount</b></td>
<td><b>DUNS Number</b></td>
<td><b>FEIN</b></td>
</tr>
<tr>
<td>1</td>
<td  style=""mso-number-format:\@"">Jim Bob</td>
<td  style=""mso-number-format:\@"">2500 N. Park Pkwy, Suite 600</td>
<td  style=""mso-number-format:\@"">Plano</td>
<td>Texas</td>
<td  style=""mso-number-format:\@"">75093</td>
<td>$0.00</td>
<td  style=""mso-number-format:\@""></td>
<td  style=""mso-number-format:\@""></td>
</tr>

</table>
</body>
</html>";

string xmlData;

// Create the document   
var document = new HtmlDocument();
document.LoadHtml(data);
document.OptionOutputAsXml = true;

// Write it to Xml so we can use XDocument Linq
using(var stream = new StringWriter())
using(var writer = XmlWriter.Create(stream))
{
    document.Save(writer);
    xmlData = stream.ToString();
}

// Use XDocument Linq to parse the xml into Customer objects
var customers = 
    XDocument.Parse(xmlData)
            .XPathSelectElements("//table/tr")
            .Skip(1)
            .Select(x => new Customer {
                Id     =     int.Parse(x.Elements("td").First().Value),
                Name   =               x.Elements("td").Skip(1).First().Value,
                Address=               x.Elements("td").Skip(2).First().Value,
                City   =               x.Elements("td").Skip(3).First().Value,
                State  =               x.Elements("td").Skip(4).First().Value,
                Zip    =               x.Elements("td").Skip(5).First().Value,
                Amount = decimal.Parse(x.Elements("td").Skip(6).First().Value,
                                       System.Globalization.NumberStyles.Currency)
            });
Dustin Kingen
  • 20,677
  • 7
  • 52
  • 92
  • Thank you so much for the example. It clears up how HTML Agility Pack works. Does the style="mso-number-format:\@ throw off the parser? – Zach Jul 03 '13 at 14:22
  • There are no issues using HtmlAgilityPack. The `Meta` tag is not valid Xml, so `HtmlAgilityPack` can fix those common issues. – Dustin Kingen Jul 03 '13 at 14:23
  • I wanted to test this and just set a label to Id value after it parsed. But as I understand that value is a non static property. How to do I set it in a static method? This might be real simple but for some reason my mind is drawling a blank. – Zach Jul 03 '13 at 15:02
  • Could you give some context? I'm not quite understanding what you mean by label. Is that a control on a win form? – Dustin Kingen Jul 03 '13 at 15:05
  • Yes. I created a win form to allow me to select a file to import and then click import to run the import. I'm trying to see how the htmlagilitypack works so I thought to set a label text to one of the parse variables to see what value it is. I belive I was wrong in my assumption and that I can't pull values out of the customer object. – Zach Jul 03 '13 at 15:10
  • The Xml values are good inside the Linq statement. You can bind the `customers` variable to a listbox or datagrid if you want to display it. – Dustin Kingen Jul 03 '13 at 15:14
  • 1
    +1 for creating a working example! – Cᴏʀʏ Jul 03 '13 at 15:36
  • While everyone was great in recommending HtmlAgilityPack. This answer helped me learn how to use it and pull the data correctly. Thanks Romoku. Now to import this customer object into my database and I just saved me self a few hours a month. – Zach Jul 03 '13 at 15:47
  • 1
    @Cory More people should keep Linqpad open on their second/third monitor. – Dustin Kingen Jul 03 '13 at 16:04
  • What would you recommend the best way to the customer object into SQL? Should I convert the object to a datatable then use SqlBulkCopy. Is there a bulkcopy to go with linq-to-sql datacontext? I'm still reading about linq and xDocument to get a better understanding as what everything is doing in your solution. Thanks again. – Zach Jul 03 '13 at 17:24
  • @Romoku: I 100% agree. Great tool. – Cᴏʀʏ Jul 03 '13 at 17:41
  • @Zach There are several ways (SqlCommand/third party orm/sqlbulkcopy) to get the data into Sql Server. Check out some other questions on StackOverflow. – Dustin Kingen Jul 03 '13 at 18:13
  • Thanks. I wasn't sure if there was a different choice in the solution you gave. – Zach Jul 03 '13 at 18:23
1

Good for parse HTML is HtmlAgility Pack. Here

slavoo
  • 5,798
  • 64
  • 37
  • 39
1

If you must receive the file in HTML format, I would look into third party libraries that can parse HTML into a usable document structure (e.g. the HTMLAgilityPack).

Using a library, you can traverse the document and pull the values from the nodes without having to try to parse them out with ugly code.

I would start by creating a class that represents a row in the table. Give it properties that represent the table headings. Then use the HTML library to obtain the values you want and create instances of said class, set the property values, and add them to a collection.

Finally, open a connection to your database. Loop over the collection of classes and insert their data one by one into your database using prepared statements.

When using the SqlConnection class, make sure to apply proper disposal techniques and take advantage of the using statement so your connection is cleaned up nicely.

Cᴏʀʏ
  • 105,112
  • 20
  • 162
  • 194
  • That is how I receive the html document every time. But thanks for the third party libraries. I will defiantly look into using that. – Zach Jul 03 '13 at 13:56
  • when traversing each node. I have some td have "style="mso-number-format:\@" added, While some are . When selecting a singlenode do I need to look for both, or by just saying TD it will put the one with the style attribute added? – Zach Jul 03 '13 at 14:15
  • Just should be fine. If you find that you need to look for both, there should be a way (via xpath or otherwise) to filter the nodes where that attribute exists. – Cᴏʀʏ Jul 03 '13 at 15:35
0

You could use HtmlAgilityPack to parse the Html and read it into a DataTable object. The DataTable could then easily be written to the database using either regular inserts or SqlBulkCopy if you are transferring large amounts of data.

Rune Grimstad
  • 35,612
  • 10
  • 61
  • 76
  • Thank you. I will check HtmlAgilityPack. With searching before I posted, I never saw anyone suggestion that. Thanks – Zach Jul 03 '13 at 13:58
0

this is how you can load the content of a HTML table into a datatable, you can then from the data table update your real table...this may be a good start, give it a try, you can copy paste it to run as a console app

namespace ConsoleApplication4   {
    class Program
{
    static void Main(string[] args)
    {
        XmlDocument doc = new XmlDocument();

    doc.LoadXml(@"<table border='1' cellpadding='0' cellspacing='0'>
                <tr>
                <td width='50%'>cell 1a</td>
                <td width='50%'>cell 1b</td>
                </tr>
                <tr>
                <td width='50%'>cell 2a</td>
                <td width='50%'>cell 2b</td>
                </tr>
                </table>");

    DataTable dt = new DataTable();

    dt.Columns.Add("Col1");

    dt.Columns.Add("Col2");

    foreach (XmlNode ndRow in doc.DocumentElement.ChildNodes)
    {
    DataRow dr = dt.NewRow();
    for (int colIndex = 0; colIndex < ndRow.ChildNodes.Count; colIndex++)
    dr[colIndex] = ndRow.ChildNodes[colIndex].InnerText;
    dt.Rows.Add(dr);
    }

    foreach (DataRow r in dt.Rows) {
        Console.Write(r[0] + " --- " + r[1] );
        Console.WriteLine("");
    }

    Console.ReadLine();

}
    }
}
joe
  • 546
  • 1
  • 6
  • 20