1

I want to use https://pypi.org/project/pyclibrary/ to parse some .h files.

Some of those .h files are unfortunately not UTF-8 encoded - Notepad++ tells me they are "ANSI" encoded (and as they originate on Windows, I guess that means CP-1252? Not sure ...)

Anyways, I can reduce the problem to this example:

mytest.h:

/*******************************************************
Just a test header file
© Copyright myself
*******************************************************/

#ifndef _MY_TEST_
#define _MY_TEST_
#endif

The tricky part here is the copyright character - and just to make sure, here is a hexdump of this:

$ hexdump -C mytest.h
00000000  2f 2a 2a 2a 2a 2a 2a 2a  2a 2a 2a 2a 2a 2a 2a 2a  |/***************|
00000010  2a 2a 2a 2a 2a 2a 2a 2a  2a 2a 2a 2a 2a 2a 2a 2a  |****************|
*
00000030  2a 2a 2a 2a 2a 2a 2a 2a  0d 0a 4a 75 73 74 20 61  |********..Just a|
00000040  20 74 65 73 74 20 68 65  61 64 65 72 20 66 69 6c  | test header fil|
00000050  65 0d 0a a9 20 43 6f 70  79 72 69 67 68 74 20 6d  |e... Copyright m|
00000060  79 73 65 6c 66 0d 0a 2a  2a 2a 2a 2a 2a 2a 2a 2a  |yself..*********|
00000070  2a 2a 2a 2a 2a 2a 2a 2a  2a 2a 2a 2a 2a 2a 2a 2a  |****************|
*
00000090  2a 2a 2a 2a 2a 2a 2a 2a  2a 2a 2a 2a 2a 2a 2f 0d  |**************/.|
000000a0  0a 0d 0a 23 69 66 6e 64  65 66 20 5f 4d 59 5f 54  |...#ifndef _MY_T|
000000b0  45 53 54 5f 0d 0a 23 64  65 66 69 6e 65 20 5f 4d  |EST_..#define _M|
000000c0  59 5f 54 45 53 54 5f 0d  0a 23 65 6e 64 69 66 0d  |Y_TEST_..#endif.|
000000d0  0a                                                |.|
000000d1

And then I try this Python script:

mytest.py

#!/usr/bin/env python3

import sys, os
from pyclibrary import CParser

myhfile = "mytest.h"
c_parser = CParser([myhfile])
print(c_parser)

When I run this, I get:

$ python3 mytest.py
Traceback (most recent call last):
  File "mytest.py", line 7, in <module>
    c_parser = CParser([myhfile])
  File "/usr/lib/python3.8/site-packages/pyclibrary/c_parser.py", line 443, in __init__
    self.load_file(f, replace)
  File "/usr/lib/python3.8/site-packages/pyclibrary/c_parser.py", line 678, in load_file
    self.files[path] = fd.read()
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 83: invalid start byte

... and I guess, the "byte 0xa9 in position 83" is the copyright character. So, the way I see it:

  • I don't really have an option to choose the file encoding in pyclibrary - but I don't want to hack pyclibrary either
  • I don't really want to edit the .h files either, and make them UTF-8 compatible

... and so, the only thing I can think of, is to change the default encoding of Python (while opening files) to ANSI/CP-1252/whatever, only for the call to c_parser = CParser([myhfile]) - and then restore the default UTF-8.

Is this possible to do somehow? I have seen Changing default encoding of Python? - but most of the answers there seem to imply, that you better just change the default encoding once, at the start of the script - I cannot find any references to changing the default encoding temporarily, and then restoring the original UTF-8 default later.

sdbbs
  • 4,270
  • 5
  • 32
  • 87

1 Answers1

0

OK, I think I got it - found this thread: Windows Python: Changing encoding using the locale module - and got inspired to try the locale package; note that I'm working in MSYS2 bash on Windows, and as such, I use the MSYS2 Python3. So, now the file is:

mytest.py

#!/usr/bin/env python3

import sys, os
import locale
from pyclibrary import CParser
import pprint

myhfile = "mytest.h"
print( locale.getlocale() ) # ('en_US', 'UTF-8')
#pprint.pprint(locale.locale_alias)
locale.setlocale( locale.LC_ALL, 'en_US.ISO8859-1' )

c_parser = CParser([myhfile])
print(c_parser)

locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
print( locale.getlocale() ) # ('en_US', 'UTF-8')

... and running this produces:

$ python3 mytest.py
('en_US', 'UTF-8')
============== types ==================
{}
============== variables ==================
{}
============== fnmacros ==================
{}
============== macros ==================
{'_MY_TEST_': ''}
============== structs ==================
{}
============== unions ==================
{}
============== enums ==================
{}
============== functions ==================
{}
============== values ==================
{'_MY_TEST_': None}

('en_US', 'UTF-8')

Well - this looks OK to me ...

sdbbs
  • 4,270
  • 5
  • 32
  • 87