1

I have a CSV file UTF-8 format with Spanish diacritics, if I load it from Access and select UTF-8 all is well. I want to automate my work so I made a Powershell script. Because the table is already created I can't load data with SELECT * but INSERT INTO. With SELECT * I can specify the UTF-8 Charset like [text;HDR=Yes;CharacterSet=65001;]. How can I do that with "INSERT INTO"?

My code so far (working but utf-8 characters are gibberish):

$PSDefaultParameterValues['*:Encoding'] = 'utf8'
$connectstring = "Provider=Microsoft.ACE.OLEDB.12.0; Data Source=C:\Users\Nobody\Desktop\Mexico-test.accdb"
$conn = New-Object System.Data.OleDb.OleDbConnection($connectstring)

$conn.Open()

$Voters = Import-Csv -Delimiter "," -Path "C:\Users\Nobody\Desktop\mexico.csv"

foreach ($Voter in $Voters)
{
    $curp = $Voter.curp
    $age = $Voter.age
    $forename = $Voter.forename
    $middlename = $Voter.middlename
    $surname = $Voter.surname
    $fatherSurname = $Voter.fatherSurname
    $motherSurname = $Voter.motherSurname
    $cmd = $conn.CreateCommand()
    $cmd.CommandText="INSERT INTO voters(curp,age,forename,middlename,surname,fatherSurname,motherSurname) VALUES('$curp','$age','$forename','$middlename','$surname','$fatherSurname','$motherSurname')"
    $cmd.ExecuteNonQuery()
}

$conn.Close()
Pᴇʜ
  • 56,719
  • 10
  • 49
  • 73
bsteo
  • 1,738
  • 6
  • 34
  • 60
  • Do any of the suggestions at https://stackoverflow.com/a/40098904/229367 help? – Pranav Negandhi Mar 20 '20 at 12:02
  • I don't think is Powershell related. – bsteo Mar 20 '20 at 12:24
  • Have you tried to print the values read from the file to the console? Do they appear correctly? – Pranav Negandhi Mar 20 '20 at 12:34
  • Seems doesn't show ok to console: `Campo L�pez` should be `Campo López` – bsteo Mar 20 '20 at 12:42
  • I did that, same result: `$Voters = Import-Csv -Delimiter "," -Path "C:\Users\Derecha\Desktop\table.csv" -Encoding utf8` – bsteo Mar 20 '20 at 12:55
  • 1
    That means that `C:\Users\Derecha\Desktop\table.csv` _isn't actually UTF-8_ - you need to determine the actual encoding and match that with `-Encoding`; e.g., try `-Encoding Default` in Windows PowerShell. – mklement0 Mar 20 '20 at 13:01
  • Your entire approach seems a bit silly, you can `INSERT INTO SomeTable SELECT Something FROM [text;HDR=Yes;CharacterSet=65001;Database=C:\SomeFolder].[SomeFile#csv]` from Powershell and not have to deal with preserving encoding from CSV to Powershell to Access (just CSV -> Access which you already figured out). – Erik A Mar 20 '20 at 13:05
  • Not silly at all, the CSV file is big, if I do your way Access complains and doesn't work. – bsteo Mar 20 '20 at 13:13
  • Well.. Then ask about it, because it should – Erik A Mar 20 '20 at 13:38

2 Answers2

1

Your code correctly requests session-wide use of UTF-8 encoding with this statement:

$PSDefaultParameterValues['*:Encoding'] = 'utf8'

See the documentation for the dictionary of parameter presets implemented via preference variable $PSDefaultParameterValues.

Therefore, in your case -Encoding UTF8 is implicitly in effect in calls to any cmdlets that have an -Encoding parameter, such as Import-Csv.

Therefore:

  • Your Import-Csv call does read your CSV file as UTF-8.

  • However, your symptom (values not printing correctly in the console) suggests that your CSV input file isn't actually UTF-8-encoded.

Thus, the solution is to determine the CSV file's actual encoding[1] and pass its name to the
-Encoding parameter
:

Given that, as you've since confirmed, your file's actual encoding was ANSI (the fixed single-byte encoding determined by your system's active legacy code page), use the Default encoding name in Windows PowerShell:

$voters = Import-Csv -Encoding Default -Delimiter ',' -Path C:\Users\Nobody\Desktop\mexico.csv

In PowerShell [Core] 6+, you actually need to pass the specific ANSI code page used, which on US-English systems is Windows-1252, for instance (see the docs for the list of supported code pages; use either the value from the ".NET Name" column or the number from the "Identifier" column, but without a leading 0).

# Use the Windows-1252 ANSI encoding.
$voters = Import-Csv -Encoding 1252 -Delimiter ',' -Path C:\Users\Nobody\Desktop\mexico.csv

Note: As of v7.0, support for the Default encoding name to refer to the active ANSI code page has inexplicably not been implemented - see this GitHub issue; make your voice heard there, if you'd like to see that changed.

For a comprehensive overview of encoding behavior in PowerShell and how it has changed between Windows PowerShell (versions up to v5.1) and PowerShell [Core] (versions starting with v6), see this answer.


[1] Determining a text file's encoding:

Note: In PowerShell [Core] 6+, Get-Content printing a file's text correctly to the screen means that all cmdlets will interpret it correctly; sadly, due the wildly inconsistent behavior of cmdlets in Windows PowerShell (versions up to v5.1) that isn't necessarily true there; Import-Csv is a prime example, because it defaults to ASCII(!) encoding - see this answer for background information.

  • Platform-specific options:

    • Windows:

      • Load the file into Notepad, which in the absence of a Unicode BOM (signature) tries to auto-detect the encoding, and can usually tell the difference between UTF-8 and ANSI: see if the text is displayed correctly, then look in the bottom right corner (status bar) for the encoding name being displayed, e.g., "ANSI"; note, however, that it won't be able to tell you what specific ANSI code page may have been used (if the file came from a different culture), because that is generally impossible to infer.
    • Unix-like platforms (macOS, Linux, including WSL):

      • Use the file utility (e.g., file mexico.csv) which tries to auto-detect the encoding.
      • Caveat: file misidentifies Windows-1252 as ISO-8859, which is not strictly correct - they overlap to a great degree, but aren't identical: see the docs.
  • Cross-platform options:

    • Pass your file to Format-Hex (e.g, Format-Hex mexico.csv) to examine the byte values; note: be sure to pass the file to the (implied) -Path parameter as an argument rather than piping its content via Get-Content to Format-Hex, because in the latter case Get-Content may already have misinterpreted the file.

    • Use Visual Studio Code: while it doesn't try to auto-detect the encoding, it offers a convenient way to re-read the file with different encodings: Click on the name of the encoding near the bottom right corner (status bar; e.g., "UTF-8") and select Reopen with Encoding, then pick an encoding of interest; rinse and repeat, until the text displays correctly.

mklement0
  • 382,024
  • 64
  • 607
  • 775
0

I implemented a bare-minimum example without DB connectivity as I don't have MS Access. Can you can specify the -Encoding parameter to Import-Csv on your computer and see if it works?

Omitting this parameter renders gibberish on my screen.

C??mo est??s

Specifying it explicitly to UTF8 renders this.

Cómo estás

$lines = Import-Csv -Delimiter "," -Path "words.csv" -Encoding utf8

foreach ($line in $lines)
{
    $word = $line
    $query = "INSERT INTO voters(word) VALUES('$line')"
    Write-Host $query
}

words.csv

Cómo estás
Dónde estás
quién eres tú
cuando estás llegando
Pranav Negandhi
  • 1,594
  • 1
  • 8
  • 17
  • 2
    Check the .csv itself. Open it in a UTF-8 compatible text editor like Sublime Text. The file has to be encoded into UTF-8 format too. – Pranav Negandhi Mar 20 '20 at 12:58
  • 1
    The problem, I think is the file, was ANSI, with UTF-8 show ok: `Cómo estás` – bsteo Mar 20 '20 at 13:02
  • Note that @bsteo's `Import-Csv` call in the question _implicitly_ uses `-Encoding Utf8`, thanks to `$PSDefaultParameterValues['*:Encoding'] = 'utf8'`, so this can't be the solution, and indeed isn't, as bsteo has since confirmed. Since there is no functional difference between the code in the question and yours, this answer will confuse future readers. – mklement0 Mar 20 '20 at 15:14