0

I am trying to dump the content of a table column from SQL Server 2K into text files, that I want to later treat with Python and output new text files.

My problem is that I can't get python to use the correct encoding, and while the input files appear fine on my text editor the output ones have broken characters apparently once every two lines.

My python code can be reduced to:

input = open('input', 'r')
string = input.read()
# Do stuff
output = open('output', 'w+')
output.write(string)

Printing this string in the windows shell gives me the expected characters, though separated from each other by one space too many.

But when I open the output file, once every two line everything is broken (though the "added" whitespaces have disappeared)

Some context: To dump the column to files, I'm using this script: spWriteStringTofile which I believe is using the default server encoding.

After some research, it appears that this encoding is SQL_Latin1_General_CP1_CI_AS. I tried adding # -*- coding: latin_1 -* at the beginning of the script, I tried converting the encoding inside SQL Server to Latin1_General_CI_AS, I tried to string.decode('latin_1').encode('utf8') but it didn't change a thing (except the last attempt which output only broken characters).

What can I try ?


EDIT 2: I tried the newFile.write(line.decode('utf-16-be').encode('utf-16-le')) solution, with throws an error at the first line of my file. From the python GUI:

(Pdb) print line
ÿþ

(Pdb) print repr(line)
'\xff\xfe\n'
(Pdb) line.decode('utf-16-be').encode('utf-16-le')
*** UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 2: truncated data

Only a newline appear in Sublime Text 2 for this first line...

When I bypass it (try: ... except: pass, quick&dirty), a newline is added between correct and incorrect lines, but the broken characters are still here.


EDIT: I went through the document line by line

newFile = open('newfile', 'a+')
with open('input') as fp:
    for line in fp:
        import pdb
        pdb.set_trace()
        newFile.write(line)

In the pdb, on a faulty line:

(Pdb) print line
                           a s  S o l d D e b i t o r , # <-- Not actual copy paste
(Pdb) print repr(line)
'\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'

However for some reason I couldn't copy/paste the print line value: I can copy the individual alphabetic character but not when I select the "whitespace" that is betzeen them...


Input:

r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda 
    then isnull(Debit,0) else isnull(DevDebit,0) end)
    else 0 end) 
     - Sum(case when DataInr >= @BeginDate and DataInr <= @Data2 
       then  (case when @Deviza='' or @Deviza=@sMoneda 
       then  isnull(Credit,0) else isnull(DevCredit,0) end)
       else 0 end) 
       else 0 end
    as SoldDebitor,

Output:

r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda 
            then  isnull(Debit,0) else isnull(DevDebit,0) end)
਍ऀ                       攀氀猀攀   攀渀搀⤀ ഀഀ
      - Sum(case when DataInr >= @BeginDate and DataInr <= @Data2 
            then  (case when @Deviza='' or @Deviza=@sMoneda
            then  isnull(Credit,0) else isnull(DevCredit,0) end)
਍ऀ                       攀氀猀攀   攀渀搀⤀ ഀഀ
        else 0 end
਍ऀ                 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀഀ
Robin
  • 9,415
  • 3
  • 34
  • 45
  • 1
    Note: the `# -*- coding: ...` at the beginning of the script doesn't affect at all the codecs when doing I/O. It's only there to tell everyone (including the compiler and automated tools) the encoding for the source code in the module (eg. for the literals). – Ricardo Cárdenes Mar 14 '14 at 12:00
  • Can you show us the result of `print repr(string)` from right where you read it, and right where you write it, for an affected line, please? – Martijn Pieters Mar 14 '14 at 12:07
  • @MartijnPieters: Edited with the info, I read and write basically at the same place. – Robin Mar 14 '14 at 12:25

1 Answers1

1

Your corrupted data is UTF-16, using big-endian byte order:

>>> line = '\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'
>>> line.decode('utf-16-be')
u'\t                 as SoldDebitor,\r\n'

but whatever is reading your file again is interpreting the data UTF-16 in little endian byte order instead:

>>> print data.decode('utf-16-le')
ऀ                 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀ਀

That's most likely because you didn't include a BOM at the start of the file, or you mangled the input data.

You really should not be reading UTF-16 data, in text modus, without decoding, as newlines encoded in two bytes are almost guaranteed to be mangled, leading to off-by-one byte order errors, which can also lead to every other line or almost every other line being mangled.

Use io.open() to read unicode data instead:

import io

with io.open('input', 'r', encoding='utf16') as infh:
    string = infh.read()

# Do stuff

with io.open('output', 'w+', encoding='utf16') as outfh:
    outfh.write(string)

because it appears your input file already has a UTF-16 BOM.

This does mean the rest of your code needs to be adjusted to handle Unicode strings instead of byte strings as well.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks for your answer. I tried the `.decode('utf-16-be').encode('utf-16-le')` solution: **1**: At the beginning of the file it throws a `UnicodeDecodeError` for truncated data, see question update in a minute, **2**: If I bypass that the rest of the document is treated without error, but the file output is the same (when viewing with sublime text 2, no special config) – Robin Mar 14 '14 at 12:54
  • @Robin: so your input file contains mixed data, that'll be much harder to repair. Some of it is using little ending, some is using big endian. You could try and use a `try:`, `except UnicodeDecodeError:` handler here, writing the data untreated if the exception is raised, but that's a *pray it works* solution to a thorny, messy situation. – Martijn Pieters Mar 14 '14 at 12:57
  • I read here: [encoding SQL_GENERQL_CP1_CI_AS](http://stackoverflow.com/questions/10124930/encoding-sql-latin1-general-cp1-ci-as-into-utf-8) that this charset was a mix between CP-1252 and UTF-8... The unicode error is thrown only on the two first lines, so I'm afraid the try catch won't affect the overall situation (and I tried, and it didn't) :/ – Robin Mar 14 '14 at 13:13
  • @Robin: right, so this is a mix of Latin-1 and UTF-16 then, not UTF-8. Any `\x00` bytes in the data are a dead giveaway then, especially when you have mostly Latin-1 text anyway. – Martijn Pieters Mar 14 '14 at 13:17
  • @Robin: but if SublimeText is displaying Chinese glyphs, then it was reading the whole file as UTF-16, it doesn't ever mix encodings for different lines. Any Latin-1 (CP-1252) data would look garbled too. – Martijn Pieters Mar 14 '14 at 13:19
  • `'\xff\xfe'` is the UTF-16 BOM, with a single character newline attached, btw. That's probably because you opened the input file in text mode. Which could explain a lot of other problems too, I think your input file is encoded just fine but you are reading it wrong. – Martijn Pieters Mar 14 '14 at 13:22
  • @Robin: With UTF-16 LE (as your BOM suggests your input file is encoded as) line endings are encoded as `\n\x00`, but reading the file without decoding properly will cut of the `\x00` and leave that for the next line. So the next line now has one byte too many, causing that line to *look* like UTF-16 BE data instead. If the file *also* uses `\r\n` line endings this gets double messy as Python will translate `\r` to `\n`, read a null, read the newline, etc, skipping nulls and adding them onto next lines willy nilly. – Martijn Pieters Mar 14 '14 at 13:29