Exporting data from SQL Server 2014 in Lao Language [Encoding/Collation issue]

Question

I have gotten a very large back-up (.BAK) of a database used in Laos which I restored on an SQL Server 2014 Express (local) server. It is a database with a lot of columns which have Lao text in them. When the people who use the database access it through the client application (some sort of Thai accounting program) they can see all of the Lao data displayed correctly when they set their displaying font to Saysettha, which is a Unicode font with the Lao characters.

However, I am trying to export the data in the tables (through a simple text or csv file, as the export process needs to be simple and repeatable) from SSMS and the data is never displayed correctly. Not even when I am doing a table query in SSMS.

I know this is probably a problem with the encoding/collation. The columns are stored in VARCHAR columns. I have tried casting them into NVARCHAR columns, which doesn't solve the problem. I also tried to collate the column to the Lao collation (the server collation is set to Thai_CI_AS after the restore of the backup);

SSMS trial

I tried to save the output table as a CSV and TXT file with different encoding, but when I open them in e.g. Notepad++ I see the same wrong characters.

Unfortunately I don't have a sample of what the text should look like as the Client software is not running on my laptop.

Ideally, I would be able to export the columns in UTF-8 encoding.

score 0 · Accepted Answer · edited May 23 '17 at 12:14

My understanding is that Lao text is generally stored in either UTF-8 encoding, UTF-16 encoding, or Code page 1133 encoding.

When the viewing application uses the same encoding that was used when the data was stored, it looks OK. Unfortunately, sometimes the viewing application isn't told exactly what encoding was used, so it "helpfully" tries to guess, and sometimes it guesses wrong. Usually wrong guesses are is obvious to a human looking at the display -- the letters aren't even from the right language.

Even worse, when you tell an application to export data from a database, rather than simply exporting the raw bytes of text, the application may "helpfully" convert the data to a different encoding. If the application happens to know the actual encoding of the data in the database, then converting to UTF-16 or UTF-8 while exporting works great; otherwise the exported data is generally mangled and unusable.

Sometimes the most difficult problems to solve are the ones where the system is actually working correctly, but I (incorrectly) think there is a problem. Sometimes this happens because of a flaw in the tools I'm using to look at the problem. If you see mostly Lao characters in Notepad++, or you can change the encoding in Notepad++ until you see mostly Lao characters, then I suspect the data in your text or csv file and the encoding that Notepad++ guessed or that you set with Encoding -> Encode is probably correct.

Is there any way for you to see if maybe the data is actually being stored, processed, exported, etc. by the database correctly, and being displayed by the client application correctly, but a rendering glitch in Notepad or SSMS is incorrectly mangling some accent marks?

"Data Formats for Bulk Import or Bulk Export (SQL Server)" seems to recommend using character format ("-c") to export data to a text file like you are doing, even for Unicode data, rather than Native format ("-n"), Unicode native format ("-N"), or Unicode character format ("-w").
How to Export utf-8 data from sql server 2008 to excel
Export query result to csv in UTF-8
"SQL Server: Export table data to UTF-8 text file"
"How to import/export MySQL database with exact character set"
Export import of asian languages data in sql server
"Microsoft Drivers for PHP for SQL Server: How to: Send and Retrieve UTF-8 Data Using Built-In UTF-8 Support"
"For Unicode data, the Collation has no bearing on physical storage." -- UTF-8 Support, SQL Server 2012 and the UTF8String UDT

This was the help I needed. I have been able to export the data through the Bulk Export, where I used the character format ("-c") and explicitly mentioned Microsoft code page [874](https://en.wikipedia.org/wiki/ISO/IEC_8859-11) (by adding "-C 874") which incorporates code page 1133. Although in ASCII, I can read the exported CSV / TXT file. An export to UTF-8 will do now. — robberth, Aug 10 '16 at 08:31

Exporting data from SQL Server 2014 in Lao Language [Encoding/Collation issue]

1 Answers1