3

I have a string containing what I guess you'd call a "special" character (o with an umlaut above it) and it's throwing off a DBF library I am using (Ethan Furman's Python DBF library https://pypi.python.org/pypi/dbf retrieve_character() function, error on last line of the function is 'ascii' codec can't decode byte 0xf6 in position 6: ordinal not in range(128) ).

The code:

def retrieve_character(bytes, fielddef, memo, decoder):
    """
    Returns the string in bytes as fielddef[CLASS] or fielddef[EMPTY]
    """
    data = bytes.tostring()
    if not data.strip():
        cls = fielddef[EMPTY]
        if cls is NoneType:
            return None
        return cls(data)
    if fielddef[FLAGS] & BINARY:
        return data
    return fielddef[CLASS](decoder(data)[0]) #error on this line
user2680039
  • 99
  • 1
  • 8
  • I think the answer you're looking for is on this page: http://stackoverflow.com/questions/6180521/unicodedecodeerror-utf8-codec-cant-decode-bytes-in-position-3-6-invalid-dat – Gary Sep 04 '13 at 15:38
  • Technically, ASCII only covers 7-bit values from 0 to 127; how to interpret high-half values has always been contentious. These days, UTF-8 (which is backwards-compatible with ASCII) has essentially supplanted it. – chrylis -cautiouslyoptimistic- Sep 04 '13 at 15:38
  • That looks incredibly more complicated than what you were asking in the original question. Also, bytes is a reserved word and has no tostring method. data.strip() returns a string so your `if not data.strip()` line probably doesn't work how you think it should...your FLAGS & BINARY line is a boolean operation, did you mean it that way? Why did you use NoneType then return None? what is `decoder`? What is the 0th element? – blakev Sep 04 '13 at 16:06

3 Answers3

4

dbf files have a codepage attribute. It sounds like it has not been correctly set with your file. Do you know which code page was used to create the data? If so, you can override the dbf's setting when you open the file:

table = dbf.Table('dbf_file', codepage='cp437')

cp437 is just an example -- use whatever is appropriate.

To see the current codepage of a dbf file (assuming you didn't override on opening) use:

table.codepage

If you specify the wrong codepage when you open the file, then the non-ascii data could be incorrect (e.g. your o with umlaut may end up as an n with tilde).

Ethan Furman
  • 63,992
  • 20
  • 159
  • 237
  • Funnily enough, using 'cp437' suppressed the error, but I am not sure if the program now works properly. Is there a problem with using the potentially wrong codepage value? – user2680039 Sep 04 '13 at 16:29
  • @user2680039: Ppdated answer. – Ethan Furman Sep 04 '13 at 17:05
  • It says "ascii (plain ol' ascii)" – user2680039 Sep 04 '13 at 17:07
  • Which is what I expected since you were getting ascii errors. What program created the file, and what locale was in effect at the time? – Ethan Furman Sep 04 '13 at 17:22
  • I'm not sure what program created them specifically -- and I'm not sure what you mean by "locale". However I compared the fields and it seems like using "cp437" works. When I output "data" they look the same in both cases (umlaut included) only now it doesn't throw an error. I'll go ahead and mark as answer. – user2680039 Sep 04 '13 at 17:25
0

Have you tried using unicodeData.encode('ascii', 'ignore')? This will convert your umlaut to an o while ignoring any conversion errors between encoding formats.

blakev
  • 4,154
  • 2
  • 32
  • 52
0

There is my way. dbf code: http://dbf-software.com/dbf-file-encoding.html you can use re.findall to get all codepage.##

  1. Heading
 ##
Windows Encodings:
874 Thai Windows
932 Japanese Windows
936 Chinese (PRC, Singapore) Windows
949 Korean Windows
950 Chinese (Hong Kong SAR, Taiwan) Windows
1250 Eastern European Windows
1251 Russian Windows
1252 Windows ANSI
1253 Greek Windows
1254 Turkish Windows
1255 Hebrew Windows
1256 Arabic Windows
MS-DOS Encodings:
437 U.S. MS-DOS
620 Mazovia (Polish) MS-DOS
737 Greek MS-DOS (437G)
850 International MS-DOS
852 Eastern European MS-DOS
857 Turkish MS-DOS
861 Icelandic MS-DOS
865 Nordic MS-DOS
866 Russian MS-DOS
895 Kamenicky (Czech) MS-DOS

Pseudo-code:

import dbf

codepage_list = ['936', '437', ...]

for codepage in codepage_list:

    tabel = dbf.Table('mydbf.dbf', codepage='cp{}'.format(codepage))
    tabel.open(dbf.READ_WRITE)
    try:
        for row in table: 
            print(row)
        table.close()
    except UnicodeDecodeError:
        print('wrong codepage', codepage)
        tabel.close()
        continue
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
bijiofzxx
  • 11
  • 1