4

I have to parse CSV file with delimiter semicolon exported from database. Simply

$csv = import-csv -Path C:\Users\user\Desktop\bla\file.csv -Delimiter ';'
foreach ($line in $csv) {     
  $field = $line -split ';'  
  echo $field[3]
}

doesn't work well because in one of columns I have sample HTML code which I have to use. Field starts with ;<div> and ends with </div>; Between tags I have tags with style atribute so there is lot semicolons. Anyone have idea how to fix or parse file with text?

Few lines of CSV file

product_code;active;name;price;vat;unit;category;producer;other_price;weight;description;stock;stock_warnlevel;availability;delivery;views;rank;rank_votes;images 1;images 2;images 3;images 4;images 5;images 6
raz;1;nazwa pierwszego;19.95;23%;szt.;kategoria;producent1;;1;<div style="background-color:#fff;min-width:640px;max-width:980px;margin:0 auto;padding: 30px"><table style="width:100%;" class="mceItemTable"><tbody><tr><td style="width:50%;"><p style="text-align:;font:16px arial;color:;margin:1em 0;">sometext</p></td><td style="width:50%;"><img style="width:100%;max-width:600px;display:block;margin:0 auto;" src="http://domain.tld/image.png"></td></tr></tbody></table></div>;;1;auto;48 godzin;0;0;0;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg
dwa;1;nazwa drugiego;25.95;23%;szt.;kategoria;producent2;;1;<div style="background-color:#fff;min-width:640px;max-width:980px;margin:0 auto;padding: 30px"><table style="width:100%;" class="mceItemTable"><tbody><tr><td style="width:50%;"><p style="text-align:;font:16px arial;color:;margin:1em 0;">sometext</p></td><td style="width:50%;"><img style="width:100%;max-width:600px;display:block;margin:0 auto;" src="http://domain.tld/image.png"></td></tr></tbody></table></div>;12.0000;1;auto;48 godzin;0;0;0;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg
trzy;1;nazwa trzeciego;29.95;23%;szt.;kategoria;producent1;;1;<div style="background-color:#fff;min-width:640px;max-width:980px;margin:0 auto;padding: 30px"><table style="width:100%;" class="mceItemTable"><tbody><tr><td style="width:50%;"><p style="text-align:;font:16px arial;color:;margin:1em 0;">sometext</p></td><td style="width:50%;"><img style="width:100%;max-width:600px;display:block;margin:0 auto;" src="http://domain.tld/image.png"></td></tr></tbody></table></div>;1.0000;1;auto;48 godzin;0;0;0;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg
cztery;1;nazwa czwartego;3.95;23%;szt.;kategoria;producent2;;1;<div style="background-color:#fff;min-width:640px;max-width:980px;margin:0 auto;padding: 30px"><table style="width:100%;" class="mceItemTable"><tbody><tr><td style="width:50%;"><p style="text-align:;font:16px arial;color:;margin:1em 0;">sometext</p></td><td style="width:50%;"><img style="width:100%;max-width:600px;display:block;margin:0 auto;" src="http://domain.tld/image.png"></td></tr></tbody></table></div>;2.0000;1;auto;48 godzin;0;0;0;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg;http://domain.tld/image.jpg
Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
tomipnh
  • 193
  • 3
  • 12
  • 1
    can you add few lines from the csv? – Avshalom Jul 06 '15 at 19:27
  • As formatted, that file just isn't going to work. Without a text qualifier, you can't reliably tell the difference between a semicolon used as a delimiter between fields vs. part of the content within the fields. Use a different delimiter, or use a text qualifier (typically quotes) so they can be distinguished. – alroc Jul 06 '15 at 19:47
  • Column with description will always starts with `
    ` regex will not work?
    – tomipnh Jul 06 '15 at 19:54
  • @tomipnh Yep, regex shouldn't work as well, you might have nested divs in production - surprise surprise. Only a HTML/XML validation can help. I've got you one. – Vesper Jul 07 '15 at 08:27

3 Answers3

2

You should use a custom parser in this case. Your file is not a valid CSV because it does not have string delimiters wrapping the data (although it's hard to correctly wrap a HTML, you might first HTML-escape it, then wrap with quotes then separate with commas/semicolons). If you are creating such a file yourself, consider using [System.Web.HttpUtility]::HtmlEncode() to perform escaping of HTML characters. If not, and you need to parse this file, you will need to join the parts of the string that are mistakingly split by semicolons - but of course, raw call to Import-CSV will not work, and you'll have to simulate its functionality.

function Import-XMLCSV {
    Param($text,[char]$delimiter=',')
    $columns, $splitText=$text.split("`r`n") # we need lines, not full string
    # also this neat trick splits first line off the rest of text
    $columns= $columns.split($delimiter) 
    $splitText | foreach {
        $splits=@{}
        $splitLine=$_.split($delimiter) # split line normally
        $index=0
        $propIndex=0
        $value=""
        $tag=""
        while ($index -lt $splitLine.length) {
            if ($value -ne "") { $value+=$delimiter }
            if ($splitLine[$index] -match "^<([a-zA-Z0-9]+)") { $tag = $matches[1] }
            $value+=$splitLine[$index]
            if ($tag -eq "") {
                # no tag found, put full string in this property
                $splits[$columns[$propIndex]]=$value
                $value=""
                $propIndex+=1
            } else {
                if ($splitLine[$index] -match "/${tag}") {
                    # if there's a corresponding tag in this piece
                    # check valid XML in here, if not, continue
                    try {
                        $xml = New-Object System.Xml.XmlDocument
                        $xml.LoadXml($value)
                        # throws exception if not a valid XML, so won't save if unpaired
                        $splits[$columns[$propIndex]]=$value
                        $value=""
                        $propIndex+=1
                        $tag=""
                    }
                    catch [System.Xml.XmlException] {
                        # no action
                        write-debug "$index $propIndex $tag $value"
                        write-debug $_.exception
                    }
                } # if matches /tag
            } # if not matches /tag, continue adding to $value
            $index+=1
        } # end while
        # past this, we've got hash table populated
        New-Object PSCustomObject -Property $splits # return prepared object
    } # end foreach splittext
}

This code works with restrictions (see below).

Note though, if you don't have a valid XML or a string in either of your fields, you will result in wrong output. Primarily, the trouble with your sample data is in your <img> tags, they aren't closed as is demanded by XML standard. To resolve, change them like so: <img style="..." src="..." /> - the last slash indicates immediate tag closure. Otherwise XML validation fails and you don't get "description" populated. XML validation in this code is a test in case there would be nested starting tags, say <div>...<div>...</div>...</div> so that building of the string won't stop after encountering the first </div>.

Community
  • 1
  • 1
Vesper
  • 18,599
  • 6
  • 39
  • 61
1

Use the below script to convert comma/semi-column/pipe separated or any other symbol delimited values to different columns in Excel. Save this as a .ps1 file.

$executingPath = split-path -parent $MyInvocation.MyCommand.Definition
$inputCSV = $executingPath + "\InputFileName.txt"
$outputXLSX = $executingPath + "\Output.xlsx"
$excel = New-Object -ComObject excel.application 
$workbook = $excel.Workbooks.Add(1)
$worksheet = $workbook.worksheets.Item(1)
$TxtConnector = ("TEXT;" + $inputCSV)
$Connector = $worksheet.QueryTables.add($TxtConnector,$worksheet.Range("A1"))
$query = $worksheet.QueryTables.item($Connector.name)
$query.TextFileOtherDelimiter = $Excel.Application.International(5)
$query.TextFileParseType  = 1
$query.TextFileColumnDataTypes = ,2 * $worksheet.Cells.Columns.Count
$query.AdjustColumnWidth = 1
$query.Refresh()
$query.Delete()
$Workbook.SaveAs($outputXLSX,51)
$excel.Quit()

Place the input file where the script file is placed and run the script. The output excel file will be generated in the same location.



By default, Windows will have a default separator according to the region. For example, it may be comma as the default delimiter. If you want to change to semi-column, follow the below steps.



Go to Control Panel and click on Region and Language. A window will open. Click on Additional Settings.

enter image description here

Now another window will open. Change the symbol in the List Separator section to the desired symbol(for example semi-column) and click apply.

enter image description here

Run the script. It will create an excel files and the columns in the excel file will be generated on the basis of semi-column.

Sarath Subramanian
  • 20,027
  • 11
  • 82
  • 86
0

This may not be a solution to what I expected but it works well, but it is much easier than a solution with parsing Xml.

$strPath="C:\Users\user\Desktop\bla\file.csv"
$objExcel=New-Object -ComObject Excel.Application
$objExcel.Visible=$false
$workbook=$objExcel.Workbooks.Open($strPath)
$worksheet = $workbook.sheets.item("file")
Write-Host $worksheet.Range("K3").Text
$objexcel.quit()

To work requires Microsoft Excel.

tomipnh
  • 193
  • 3
  • 12