3

While importing data from a flat file, I noticed some embedded hex-values in the string (<0x00>, <0x01>).

I want to replace them with specific characters, but am unable to do so. Removing them won't work either. What it looks like in the exported flat file: https://i.stack.imgur.com/qxiEl.png Another example: https://i.stack.imgur.com/NJR8G.png


This is what I've tried: (and mind, <0x01> represents a none-editable entity. It's not recognized here.)

import io
with io.open('1.txt', 'r+', encoding="utf-8") as p:
    s=p.read()
# included in case it bears any significance
import re
import binascii

s = "Some string with hex: <0x01>"

s = s.encode('latin1').decode('utf-8')
# throws e.g.: >>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 114: invalid start byte

s = re.sub(r'<0x01>', r'.', s)
s = re.sub(r'\\0x01', r'.', s)
s = re.sub(r'\\\\0x01', r'.', s)
s = s.replace('\0x01', '.')
s = s.replace('<0x01>', '.')
s = s.replace('0x01', '.')

or something along these lines in hopes to get a grasp of it while iterating through the whole string:

for x in s:
    try:
        base64.encodebytes(x)
        base64.decodebytes(x)
        s.strip(binascii.unhexlify(x))
        s.decode('utf-8')
        s.encode('latin1').decode('utf-8')
    except:
        pass

Nothing seems to get the job done.

I'd expect the characters to be replacable with the methods I've dug up, but they are not. What am I missing? NB: I have to preserve umlauts (äöüÄÖÜ)

-- edit:

Could I introduce the hex-values in the first place when exporting? If so, is there a way to avoid that?

with io.open('out.txt', 'w', encoding="utf-8") as temp:
    temp.write(s)
P. A. Monsaille
  • 152
  • 1
  • 1
  • 10

1 Answers1

2

Judging from the images, these are actually control characters. Your editor displays them in this greyed-out way showing you the value of the bytes using hex notation. You don't have the characters "0x01" in your data, but really a single byte with the value 1, so unhexlify and friends won't help.

In Python, these characters can be produced in string literals with escape sequences using the notation \xHH, with two hexadecimal digits. The fragment from the first image is probably equal to the following string:

"sich z\x01 B. irgendeine"

Your attempts to remove them were close. s = s.replace('\x01', '.') should work.

lenz
  • 5,658
  • 5
  • 24
  • 44
  • Yep, that did it … thank you. Fyi, I figured out that I introduced the characters myself during re.sub replacements. For example, `re.sub('(?<=\w)([,.!?;])(?=\w)', u'\1 ', s)` backreferenced the replaced character and thus introduced the "single byte with the value 1". The regex-module apparently does a better job at this: `'(?<=\w)\p{;,\.!\?}(?=\w)'`. (via [reference](https://stackoverflow.com/questions/4324790/removing-control-characters-from-a-string-in-python)) – P. A. Monsaille Mar 27 '19 at 15:18
  • I don't think `re.sub` with `\n` backreferences will introduce control characters. That is, unless you mispell the backreferences as `\x01`, of course. Btw, if this answer solved the problem you described, consider accepting it through the tick on the left. – lenz Mar 27 '19 at 18:57