1

I'm trying to write a table from a .csv file with Hebrew text in it to an sql server database.
the table is valid and pandas reads the data correct (even displays the hebrew properly in pycharm),
but when i try to write it to a table in the database i get question marks ("???") where the Hebrew should be.

this is what i've tried, using pandas and sqlalchemy:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('mssql+pymssql://server/test?charset=utf8')
connection = engine.connect()

df = pd.read_csv("temp.csv", low_memory=False, encoding="UTF-8")
table_name = "test"
df.to_sql(table_name, connection, index=False, if_exists="append")

this loads the table properly but fails to write the Hebrew,
any suggestions?

Dror Bogin
  • 453
  • 4
  • 13
  • What is the datatype of your column? – Thom A Nov 25 '18 at 13:58
  • its read into a string in python and written as text into sql server. – Dror Bogin Nov 25 '18 at 14:00
  • Yes, but what is the datatype of your column? "String" is not a SQL Server data type. – Thom A Nov 25 '18 at 14:02
  • as i've answered TEXT in sql server. – Dror Bogin Nov 25 '18 at 14:03
  • `text` has been deprecated since 2005, and doesn't support non-ansi characters. The unicode equivalent would be `ntext`, however, that too is deprecated. Change your column's datatype to `nvarchar(MAX)` in you need to store 4000+ character values with non-ansi characters. – Thom A Nov 25 '18 at 14:05
  • how can i define that with pandas.to_csv? i use that to create the tables as well as insert data – Dror Bogin Nov 25 '18 at 14:07
  • You need to change the columns definition in SQL Server, not in pandas. `varchar`, `char`, and the deprecated `text`, will lose the value of any non-ANSI characters and store the value `'?'`. For example `DECLARE @v varchar(10), @n nvarchar(10); SET @v = N'ありがとう'; SET @n = N'ありがとう'; SELECT @v, @n;` Notice that one of those has lost the value of every character. – Thom A Nov 25 '18 at 14:09

1 Answers1

4

You need to change the datatype of your column. text is deprecated, and varchar(MAX) should be used instead, however, neither can store unicode characters. To store unicode characters you would need to use ntext, which is also deprecated; you need to use nvarchar(MAX).

To change your column's definition, you can use this pseudo-SQL (You'll need to replace the parts in braces ({}) with the appropriate object names):

ALTER TABLE {YourTable} ALTER COLUMN {YourColumn} nvarchar(MAX);

Edit: note, this will not restore any data lost in your column. Once a non-ANSI character is inserted into a varchar (or similar) datatype the data is immediately lost and cannot be recovered apart from by changing the datatype and reentry.

Thom A
  • 88,727
  • 11
  • 45
  • 75