My issue was similar in that UTF-8 text was getting passed to the Python script.
In my case, it was from SQL using the sp_execute_external_script in the Machine Learning service for SQL Server. For whatever reason, VARCHAR data appears to get passed as UTF-8, whereas NVARCHAR data gets passed as UTF-16.
Since there's no way to specify the default encoding in Python, and no user-editable Python statement parsing the data, I had to use the SQL CONVERT()
function in my SELECT query in the @input_data
parameter.
So, while this query
EXEC sp_execute_external_script @language = N'Python',
@script = N'
OutputDataSet = InputDataSet
',
@input_data_1 = N'SELECT id, text FROM the_error;'
WITH RESULT SETS (([id] int, [text] nvarchar(max)));
gives the error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0: unexpected end of data
Using CONVERT(type, data)
(CAST(data AS type)
would also work)
EXEC sp_execute_external_script @language = N'Python',
@script = N'
OutputDataSet = InputDataSet
',
@input_data_1 = N'SELECT id, CONVERT(NVARCHAR(max), text) FROM the_error;'
WITH RESULT SETS (([id] INT, [text] NVARCHAR(max)));
returns
id text
1 Ç