0

I have the below code to convert from html table to csv. But somehow if there is any "," in the table cell data.. it truncates the table cell and creates a new column.. how can i make sure if the table cell data doesnt truncates if it contains "comma"..

$r = Invoke-WebRequest 'https://htmlwithtable.htm'
$data = ( $r.ParsedHtml.getElementsByTagName("table") | Select-Object -First 1 ).rows
$table = @()
forEach($datum in $data){
    if($datum.tagName -eq "tr"){
        $thisRow = @()
        $cells = $datum.children
        forEach($cell in $cells){
           if($cell.tagName -imatch "t[dh]"){
                $thisRow += $cell.innerText
            }
        }
        $table += $thisRow -join ","
    }
}

$table | out-file c:\change\htmltocsv.csv -Encoding ascii
Enigma
  • 123
  • 1
  • 13

1 Answers1

1

You need to quote the values you are retrieving from the cells.

Something like this should work:

# load the System.Web assembly to be able to decode HTML entities
Add-Type -AssemblyName System.Web

$result = Invoke-WebRequest 'https://htmlwithtable.htm'
$data = $result.ParsedHtml.getElementsByTagName("table") | Select-Object -First 1
$table = $data.Rows | ForEach-Object {
    if ($_.tagName -eq 'tr'){
        $csvRow = foreach($cell in $_.children){
            if ($cell.tagName -match 't[dh]'){
                # decode HTML entities and double-up quotes that the value may contain 
                $value = [System.Web.HttpUtility]::HtmlDecode($cell.innerText) -replace '"', '""'
                if ($cell.innerHtml -match 'href="([^"]*)') {
                    # if the cell contains a link, add it to the value between brackets
                    $value += ' ({0})' -f $Matches[1]
                }
                '"{0}"' -f $value
            }
        }
        $csvRow -join ','
    }
}

$table | Out-File 'c:\change\htmltocsv.csv'
Theo
  • 57,719
  • 8
  • 24
  • 41
  • Thanks Theo. But some how its truncates the cell data.The html source says below for the problematic cell Migrate"Dell,Lenovo workstatio" to Windows 10 – Enigma Oct 21 '19 at 10:21
  • Also the below seems to work but its stripping the header from the csv [link] https://stackoverflow.com/questions/25918094/how-to-convert-html-table-to-csv-file-with-same-structure-with-powershell – Enigma Oct 21 '19 at 11:23
  • @Enigma Thanks for that example. I have edited the code to deal with HTML entities and possible quote characters within the cell's value – Theo Oct 21 '19 at 13:15
  • Thanks Theo and much appreciate your help with this. Since the website i pull the table is dynamic, the problematic table cell is removed and i have to wait for similar occurrence. I will update once i have further occurrence but its working fine with other cell data and also trying to test from a saved html file. Will keep you posted, – Enigma Oct 22 '19 at 04:50
  • is there anyway to save the html link from a html table to csv file. – Enigma Oct 24 '19 at 11:46
  • @Enigma You mean when a table cell contains a `displaytext` you want only the href part? – Theo Oct 24 '19 at 16:25
  • I want both to be saved in csv file.i mean href and the link name – Enigma Oct 24 '19 at 22:46
  • @Enigma Sorry for the late reply.. I have added a bit to include the link (if any) to the csv. – Theo Oct 28 '19 at 15:19