13

In a project I am working on my data is stored in SQL Server, with the collation Danish_Norwegian_CI_AS. The data is output'ed through FreeTDS and ODBC, to python that handles the data as UTF-8. Some of the characters, like å, ø and æ, are not being coded correctly, causing the project progress to grind to a halt.

I spent a couple of hours reading about the confusing world of encodings, collation and code-pages, and feel like I have gotten a better understanding of the entire picture.

Some of the articles I have read, makes me think that it would be possible to: Specify in the SQL select statement, that the collation data should be encoded to UTF-8 when it is output'ed.

The reason I am thinking this is possible is this article which shows an example of how to get to tables, with different collations, to play nice together.

Any pointers in the direction of converting collation to UTF-8 / UTF-16, would be greatly appreciated!

EDIT: I have read that SQL Server provides a unicode option through nchar, nvarchar and ntext, and that the other string variables char, varchar and text are coded according to set collation. I have also read that the above mentioned unicode options are coded in utf-16 variant ucs-2 (I hope I am remembering that right). So; in order to allow tables of locale collation and unicode, to play nice, there should be a conversion function, no?

Rookie
  • 1,590
  • 5
  • 20
  • 34
  • having a similar problem using mySQL and jdbc, this solved my problem. creating the tables using `create table a(.....) DEFAULT CHARSET=utf8;` and connecting to the database using ' `"jdbc:mysql://localhost/testDb?useUnicode=true&characterEncoding=UTF-8";` so i assume there must be a way with MS SQL sever – Skaros Ilias May 16 '15 at 21:58
  • @SkarosIlias Thank you for your input. Unfortunately the table is already created and populated, so that is not an option for me. – Rookie May 16 '15 at 22:02
  • give it a try with a new table and insert some test data just to make sure this works. if it does do an `alter table` – Skaros Ilias May 16 '15 at 22:05
  • You can put per field encoding collation in SqlServer. So you would need a second field with the alternate encoding. I would do this in an in memory temp table with an explicit collation on two fields. The data in the two fields may look the same but one has your different collation. If this works then you can fine tune some more. Also I would try a change of encoding type on the client side independent of Sql. From UTF-8 to binary to Danish Norwegian. – Sql Surfer May 16 '15 at 22:08
  • @SkarosIlias So there is a possibility of altering the entire table? The problem with that is that the database is used by our ERP system. So changing the encoding is not an option :S I have to get output from the collated tables, as utf-8 or utf-16.. – Rookie May 16 '15 at 22:09
  • does the erp system can read the data as it should? with all the special characters? I am no sql or encoding expert but that would seem strange. if data are saved as UTF then the erp AND you could read it in utf8. if they are not in utf8 the erp should read it as such either. – Skaros Ilias May 16 '15 at 22:13
  • maybe for your selects, you can create a temp table with the correct collation, then insert what you want into the temp table. – Jeremy May 16 '15 at 22:19
  • again on mysql yes you [can](https://dev.mysql.com/doc/refman/5.1/en/alter-table.html) i havent done it on SqlServer – Skaros Ilias May 16 '15 at 22:20

2 Answers2

20

4 months on, I finally found the answer to my problem. Turns out it had nothing to do with the FreeTDS driver, or the database collation:

It was pyodbc's connect function, which apparently requires a flag; unicode_results=True

Posted here to help other unfortunate soules doomed to wander aimlessly in the dark, looking for a clue.

Rookie
  • 1,590
  • 5
  • 20
  • 34
  • Thanks. I was wandering aimlessly in the dark and your answer helped me find the [comparable UTF-8 setting in the Microsoft SQLSRV driver](https://learn.microsoft.com/en-us/sql/connect/php/how-to-send-and-retrieve-utf-8-data-using-built-in-utf-8-support). (SQLSRV is a wrapper over the ODBC driver.) I used the connection option, but the driver also supports [setting the encoding on a statement-by-statement basis](https://learn.microsoft.com/en-us/sql/connect/php/constants-microsoft-drivers-for-php-for-sql-server). – Robert Calhoun Apr 17 '18 at 13:32
5

It seems that SQL does not support UTF-8 (see here) but you can try changing the collation in the select like:

SELECT Account COLLATE SQL_Latin1_General_CP1_CI_AS
from Data

You can also strip the accents using this solution: How to remove accents and all chars <> a..z in sql-server?

Another solution could be casting your column to nvarchar

SELECT cast (Account as nvarchar) as NewAccount 
from Data

where Account is varchar on your initial table.

If for example you try:

SELECT cast(cast(N'ţ' as varchar) as nvarchar)

the end result will be "ţ"

Community
  • 1
  • 1
sbiz
  • 321
  • 3
  • 10
  • SQL Server supports UTF-8 since version 2019: https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16#utf8 – Vasiliy Zverev Mar 22 '23 at 12:32