4

I've got to grab some product data off an existing website to put into a database. The data is all in HTML table format, the model numbers are unique, but each product can have any number of different attributes (so the tables I need to parse all have different columns and headings).

<table>
<tr>
<td>Model No.</td>
<td>Weight</td>
<td>Colour</td>
<td>Etc..</td>
</tr>
<tr>
<td>8572</td>
<td>12 Kg</td>
<td>Red</td>
<td>Blah..</td>
</tr>
<tr>
<td>7463</td>
<td>7 Kg</td>
<td>Blue</td>
<td>Blah..</td>
</tr>
<tr>
<td>8332</td>
<td>42 Kg</td>
<td>Yellow</td>
<td>Blah..</td>
</tr>
</table>

This is the CSV output format I'm looking for:

Model-No,Attribute-Name,Attribute-Value
8572,"Weight","12 Kg"
8572,"Colour","Red"
8572,"Etc","Blah.."
7463,"Weight","7 Kg"
7463,"Colour","Blue"
7463,"Etc","Blah.."
8332,"Weight","42 Kg"
8332,"Colour","Yellow"
8332,"Etc","Blah.."

As the tables all appear to be valid xhtml I'll probably load each one into an XmlDocument, but does anyone have any suggestions for a better way of accomplishing this? Thanks.

Nick
  • 5,616
  • 10
  • 52
  • 72
  • @Ash - WebApp - I'm only after code to convert the html table string into the csv string though – Nick Jun 15 '11 at 10:48
  • Take a look here http://www.codeproject.com/Tips/142467/Convert-HTMLTable-to-Comma-Separated-Values – NoWar May 28 '15 at 01:06

5 Answers5

5

I can think of 3 ways to do this:

  • HTML Agility pack: load the HTML and loop through the elements and write your CSV. Some examples here.
  • Use Regex to parse the table.
  • if your HTML is XHTML (valid XML) you can write a XSLT template to create the CSV authomatically. This is the neatest but not the easiest one.
Aliostad
  • 80,612
  • 21
  • 160
  • 208
  • 4th option could be Linq to XML but a little search here on stack overflow shows that HTML Agility Pack now supports Linq to Object pretty much like Linq to XML more: http://stackoverflow.com/questions/542194/c-is-there-a-linq-to-html-or-some-other-good-net-html-manipulation-api/542278#542278 – Usman Masood Jun 15 '11 at 10:49
  • Any reason you would avoid using XmlDocument to loop through the HTML? – Nick Jun 15 '11 at 10:56
  • No! I could not think of that option. It is a valid option and perhaps better than the rest **provided the HTML is a valid XML** - which is not always the case. – Aliostad Jun 15 '11 at 11:00
  • annotiation to "use regex": in this case, this is possible, but generally, HTML is not a regular language and thus not parseable using regular expressions! – eFloh Jun 15 '11 at 11:58
3

You can always go with Linq to XML assuming you are at least in NET 3.5 environment.

Damian
  • 511
  • 5
  • 10
1

HtmlAgilityPack is amazing for scraping data off html web pages, use that to scrap the tables into some sort of intermediate object, then you can form a csv file from this object.

Khaled Nassar
  • 884
  • 8
  • 23
0

There is a very easy way (albeit an inelegent one) to accomplish this. If its just a one off, just open the htm/html file with the table in it with excel and then save the sheet as a .csv file (if there is any data outside the table in the file it can easily be removed from excel).

If you will be repeating this task you can use Microsoft.Office.Interop.Excel namespace in C# (or VB .net) to automate it in a few lines like so:

using Microsoft.Office.Interop.Excel;

...

Application app = new Application();
app.ScreenUpdating = false;
app.DisplayAlerts = false;
app.AskToUpdateLinks = false;
app.Visible = false;

Workbook workbook = app.Workbooks.Open(fileName + ".html", false, false,
               Type.Missing, Type.Missing, Type.Missing, Type.Missing,
               Type.Missing, Type.Missing,
               Type.Missing, Type.Missing, Type.Missing, Type.Missing,
               Type.Missing, Type.Missing);


workbook.SaveAs(fileName + ".csv", Microsoft.Office.Interop.Excel.XlFileFormat.xlCSV);

workbook.Close(false, Type.Missing, Type.Missing);
workbook = null;
app.Quit();
app = null;

... 

And for this case it should be easy to parse out the non-table in the html file using Regex on the table tags if necessary. In Visual Studio 2005 and up you just right click on references for your project and you should find Microsoft.Office.Interop.Excel under the .Net tab.

jm1102
  • 180
  • 1
  • 2
0

In addition to HtmlAgilityPack, Khaled Nassar mentioned. You can do it via jQuery apply .each('tr') and assign 1st, 2nd and 3rd child to product object which you can send via service or handler which will create cvs from it.

eugeneK
  • 10,750
  • 19
  • 66
  • 101