I am trying to dump the content of a table column from SQL Server 2K into text files, that I want to later treat with Python and output new text files.
My problem is that I can't get python to use the correct encoding, and while the input files appear fine on my text editor the output ones have broken characters apparently once every two lines.
My python code can be reduced to:
input = open('input', 'r')
string = input.read()
# Do stuff
output = open('output', 'w+')
output.write(string)
Printing this string in the windows shell gives me the expected characters, though separated from each other by one space too many.
But when I open the output file, once every two line everything is broken (though the "added" whitespaces have disappeared)
Some context: To dump the column to files, I'm using this script: spWriteStringTofile
which I believe is using the default server encoding.
After some research, it appears that this encoding is SQL_Latin1_General_CP1_CI_AS
. I tried adding # -*- coding: latin_1 -*
at the beginning of the script, I tried converting the encoding inside SQL Server to Latin1_General_CI_AS
, I tried to string.decode('latin_1').encode('utf8')
but it didn't change a thing (except the last attempt which output only broken characters).
What can I try ?
EDIT 2: I tried the newFile.write(line.decode('utf-16-be').encode('utf-16-le'))
solution, with throws an error at the first line of my file. From the python GUI:
(Pdb) print line
ÿþ
(Pdb) print repr(line)
'\xff\xfe\n'
(Pdb) line.decode('utf-16-be').encode('utf-16-le')
*** UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 2: truncated data
Only a newline appear in Sublime Text 2 for this first line...
When I bypass it (try: ... except: pass
, quick&dirty), a newline is added between correct and incorrect lines, but the broken characters are still here.
EDIT: I went through the document line by line
newFile = open('newfile', 'a+')
with open('input') as fp:
for line in fp:
import pdb
pdb.set_trace()
newFile.write(line)
In the pdb, on a faulty line:
(Pdb) print line
a s S o l d D e b i t o r , # <-- Not actual copy paste
(Pdb) print repr(line)
'\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'
However for some reason I couldn't copy/paste the print line
value: I can copy the individual alphabetic character but not when I select the "whitespace" that is betzeen them...
Input:
r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda
then isnull(Debit,0) else isnull(DevDebit,0) end)
else 0 end)
- Sum(case when DataInr >= @BeginDate and DataInr <= @Data2
then (case when @Deviza='' or @Deviza=@sMoneda
then isnull(Credit,0) else isnull(DevCredit,0) end)
else 0 end)
else 0 end
as SoldDebitor,
Output:
r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda
then isnull(Debit,0) else isnull(DevDebit,0) end)
ऀ 攀氀猀攀 攀渀搀⤀ ഀഀ
- Sum(case when DataInr >= @BeginDate and DataInr <= @Data2
then (case when @Deviza='' or @Deviza=@sMoneda
then isnull(Credit,0) else isnull(DevCredit,0) end)
ऀ 攀氀猀攀 攀渀搀⤀ ഀഀ
else 0 end
ऀ 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀഀ