How can I most efficiently differentiate a "properly" decoded string from one which decodes to a hex value?

Question

In cybersecurity we often use the strings binary to extract any plaintext string data in a memory dump. I'm trying to do the equivalent in Python.

from struct import unpack

find_str = "King"
strings = []
for stream in data.streams:
    if type(stream.data) == bytes:
        
        # Not a particular readable solution, but orders of magnitude faster than
        # alternatives: https://stackoverflow.com/a/57543519/9400421
        unpacked = list(unpack(f'{len(stream.data)}c', stream.data))
        string = ''
        null = b'\x00'
        for byte in unpacked:
            try:
                # ultimately need to track multiple strings arrays for each 
                # popular encoding scheme to catch diverse string encodings.
                decoded = byte.decode('ASCII')
                print(byte, '=>', decoded)
                if byte == null:
                    print(byte, '=>', 'null')
                    if string != '':
                        strings.append(string)
                    string = ''
                else:
                    string += decoded
            except:
                print("couldn't decode:", byte)
                if string != '':
                    strings.append(string)
                string = ''
print(strings)

output: ... , '*', '\x7f', '\x10', '\x10', '\x04', '\x01', '\x12+', '\x7f', '*', '\x7f', '@', '\x10', '\x02', '\x01', '\x10\x13+', '\x7f', '\x0c', '\x01', ...

My problem is this outputs a lot of decoded values which are obviously not normal characters - they're decoded to hex strings.

My first question is: Why are these hex strings not decoding to normal characters, but not triggering my catch statement? I thought anything not "cleanly" decoded as characters by the decoding method used would be filtered out by my code.

My second question is: How can I discard the "trash" characters / filter them from "cleanly" decoded characters?

Without explicitly testing each of these characters in your output by hand, I'd suggest that maybe you could try to remove unprintables via `str.isprintable()`? — iteratedwalls, Sep 13 '22 at 02:02
@iteratedwalls It seems to work. It throws out a few weird ASCII symbol characters that the console can print, but that's not really of much concern to me. Probably for the best, actually. — J.Todd, Sep 13 '22 at 02:07

ukBaz · Accepted Answer · 2022-09-13T08:09:28.037

The solution boils down to this which is decode the bytes to a string and only keep the characters that are printable.

>>> data = b"A \x04 test \x12 string\x00\x00\x00."
>>> ''.join([x for x in data.decode('ascii') if x.isprintable()])
'A  test  string.'

It looks like your code could be simplified to:

stream_strings = []
for stream in data.streams:
    if type(stream.data) == bytes:
        result = ''.join([x for x in stream.data.decode('ascii') if x.isprintable()])
        stream_strings.append(result)
print(stream_strings)

How can I most efficiently differentiate a "properly" decoded string from one which decodes to a hex value?

1 Answers1