I want to use https://pypi.org/project/pyclibrary/ to parse some .h files.
Some of those .h files are unfortunately not UTF-8 encoded - Notepad++ tells me they are "ANSI" encoded (and as they originate on Windows, I guess that means CP-1252? Not sure ...)
Anyways, I can reduce the problem to this example:
mytest.h
:
/*******************************************************
Just a test header file
© Copyright myself
*******************************************************/
#ifndef _MY_TEST_
#define _MY_TEST_
#endif
The tricky part here is the copyright character - and just to make sure, here is a hexdump of this:
$ hexdump -C mytest.h
00000000 2f 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |/***************|
00000010 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00000030 2a 2a 2a 2a 2a 2a 2a 2a 0d 0a 4a 75 73 74 20 61 |********..Just a|
00000040 20 74 65 73 74 20 68 65 61 64 65 72 20 66 69 6c | test header fil|
00000050 65 0d 0a a9 20 43 6f 70 79 72 69 67 68 74 20 6d |e... Copyright m|
00000060 79 73 65 6c 66 0d 0a 2a 2a 2a 2a 2a 2a 2a 2a 2a |yself..*********|
00000070 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00000090 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2f 0d |**************/.|
000000a0 0a 0d 0a 23 69 66 6e 64 65 66 20 5f 4d 59 5f 54 |...#ifndef _MY_T|
000000b0 45 53 54 5f 0d 0a 23 64 65 66 69 6e 65 20 5f 4d |EST_..#define _M|
000000c0 59 5f 54 45 53 54 5f 0d 0a 23 65 6e 64 69 66 0d |Y_TEST_..#endif.|
000000d0 0a |.|
000000d1
And then I try this Python script:
mytest.py
#!/usr/bin/env python3
import sys, os
from pyclibrary import CParser
myhfile = "mytest.h"
c_parser = CParser([myhfile])
print(c_parser)
When I run this, I get:
$ python3 mytest.py
Traceback (most recent call last):
File "mytest.py", line 7, in <module>
c_parser = CParser([myhfile])
File "/usr/lib/python3.8/site-packages/pyclibrary/c_parser.py", line 443, in __init__
self.load_file(f, replace)
File "/usr/lib/python3.8/site-packages/pyclibrary/c_parser.py", line 678, in load_file
self.files[path] = fd.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 83: invalid start byte
... and I guess, the "byte 0xa9 in position 83" is the copyright character. So, the way I see it:
- I don't really have an option to choose the file encoding in
pyclibrary
- but I don't want to hackpyclibrary
either - I don't really want to edit the .h files either, and make them UTF-8 compatible
... and so, the only thing I can think of, is to change the default encoding of Python (while opening files) to ANSI/CP-1252/whatever, only for the call to c_parser = CParser([myhfile])
- and then restore the default UTF-8.
Is this possible to do somehow? I have seen Changing default encoding of Python? - but most of the answers there seem to imply, that you better just change the default encoding once, at the start of the script - I cannot find any references to changing the default encoding temporarily, and then restoring the original UTF-8 default later.