1

With Powershell, I can get the table with this - $URL = "http://example.com/yyy.htm" $OutputFile = "$env:temp\tempfile.xml"

# reading website data:
$data = Invoke-WebRequest -Uri $URL 

# get the first table found on the website and write it to disk:
@($data.ParsedHtml.getElementsByTagName("table"))[0].OuterHTML | Set-Content -Path $OutputFile

Now I want this table to be converted to CSV... How do I do that?

Table example -

Datacenter | FirstDNS | SecondDNS | ThirdDNS | FourthDNS
-----------------------------------------------------------
NewYork    | 1.1.1.1  | 2.2.2.2   |3.3.3.3   | 4.4.4.4
India      | 1.2.3.4  | 3.2.6.5   |8.2.3.7   | 8.3.66.1
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Ranjit Roy
  • 11
  • 1
  • 1
  • 2
  • NB: `Set-Content` only creates a text file with the extension .xml; it's not necessarily a real xml file. I guess you're just writing to an xml file to do the conversion; for that you can simply do it by converting the string to xml; `[xml]$myXml = @($data.ParsedHtml.getElementsByTagName("table"))[0].OuterHTML | out-string`. NB: This will not be robust; if you're not using XHTML there's no guarentee that the HTML will obey XML's syntax rules (e.g. HTML5 has void elements, so there `
    ` should not have a closing tag; whilst in XML it must be iether `
    ` or `
    `.
    – JohnLBevan Aug 11 '15 at 19:10
  • Ref http://stackoverflow.com/questions/3558119/are-self-closing-tags-valid-in-html5 – JohnLBevan Aug 11 '15 at 19:10

2 Answers2

3

Here's a solution convert HTML tables to PSObjects, which you can then pipe to Export-CSV or do whatever you need to. Please note: this is not a clean solution; it just about does the job for simple scenarios, but has a lot of issues:

  • Can't cope with special characters (other than  , to get it to work you'll need to add new definitions to the DocType's entity map as required)
  • Can't cope with colspan or rowspan; assumes that all tables have the same number of columns in every row as they had in the header (there's a tweak to prevent errors if there's more columns than headers; but you may still get misalignment in that scenario).
  • My technique for cleaning the HTML before converting to XML was to use a regex rather than a parsing library; so there could well be unexpected issues there.
function ConvertFrom-HtmlTableRow {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true)]
        $htmlTableRow
        ,
        [Parameter(Mandatory = $false, ValueFromPipeline = $false)]
        $headers
        ,
        [Parameter(Mandatory = $false, ValueFromPipeline = $false)]
        [switch]$isHeader

    )
    process {
        $cols = $htmlTableRow | select -expandproperty td
        if($isHeader.IsPresent) {
            0..($cols.Count - 1) | %{$x=$cols[$_] | out-string; if(($x) -and ($x.Trim() -gt [string]::Empty)) {$x} else {("Column_{0:0000}" -f $_)}} #clean the headers to ensure each col has a name        
        } else {
            $colCount = ($cols | Measure-Object).Count - 1
            $result = new-object -TypeName PSObject
            0..$colCount | %{
                $colName = if($headers[$_]){$headers[$_]}else{("Column_{0:00000} -f $_")} #in case we have more columns than headers 
                $colValue = $cols[$_]
                $result | Add-Member NoteProperty $colName $colValue
            } 
            write-output $result
        }
    }
}

function ConvertFrom-HtmlTable {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true)]
        $htmlTable
    )
    process {
        #currently only very basic <table><tr><td>...</td></tr></table> structure supported
        #could be improved to better understand tbody, th, nested tables, etc

        #$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow

        #remove anything tags that aren't td or tr (simplifies our parsing of the data
        [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp '&#160;'>]><root>{0}</root>" -f ($htmlTable | select -ExpandProperty innerHTML | %{(($_ | out-string) -replace '(</?t[rdh])[^>]*(/?>)|(?:<[^>]*>)','$1$2') -replace '(</?)(?:th)([^>]*/?>)','$1td$2'})) 
        [string[]]$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
        if ($headers.Count -gt 0) {
            $cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
        }
    }
}

clear-host

[System.Uri]$url = 'https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions' 
$rqst = Invoke-WebRequest $url 
$rqst.ParsedHtml.getElementsByTagName('table') | ConvertFrom-HtmlTable 

FYI: I've also published an earlier version of this code on CodeReview, so check there to see if anyone suggests any good improvements.

Community
  • 1
  • 1
JohnLBevan
  • 22,735
  • 13
  • 96
  • 178
0

Use Tee-Object to output the file, and you can pipe to Export-CSV after that:

@($data.ParsedHtml.getElementsByTagName("table"))[0].OuterHTML | Tee-Object -FilePath $OutputFile | Export-CSV $env:temp\tempfile.csv -notype
TheMadTechnician
  • 34,906
  • 3
  • 42
  • 56
  • thanks for the response...but I only get a file that has a row called "Length" followed by "139670" below it...no data :( – Ranjit Roy Sep 18 '14 at 17:14
  • how can I add a sample? – Ranjit Roy Sep 18 '14 at 17:27
  • Ah, so that doesn't grab something that can be easily converted to CSV. Sorry, I don't have an HTML table to use to test with so I went with an assumption. Can you post a link to a table that could be used as example? Or maybe pastebin something? – TheMadTechnician Sep 18 '14 at 17:32
  • my heartiest Apologies...its an internal site...so I converted a small portion to something that I can share as an attachment but cant find the place to attach. How may I send it to you. You can contact me at luckyranjitroy@gmail.com...It is utterly important to me... – Ranjit Roy Sep 18 '14 at 17:36
  • @Moderator- can you please help? How do I attach my file for use? – Ranjit Roy Sep 18 '14 at 17:44