How to print only printable charcters in binary file (equvalent to strings under Linux)?

Question

I am undertaking conversion of my python application from python 2 to python 3. One of the functions which I use is to get the printable character out of binary file. I earlier used following function in python 2 and it worked great:

import string

def strings(filename, min=4):
    with open(filename, "rb") as f:
        result = ""
        for c in f.read():
            if c in string.printable:
                result += c
                continue
            if len(result) >= min:
                yield result
            result = ""
        if len(result) >= min:  # catch result at EOF
            yield result

Code is actually from Python equivalent of unix "strings" utility. When I run the above code with python 2 it produces the output like this which is absolutely ok for me:

 +s
^!1^
i*Q(
}"~ 
%lh!ghY
#dh!
!`,!
mL#H
o!<XXT0
'   < 
z !Uk
%
 wS
n`  !wl
*ty

(Q  6
!XPLO$
E#kF

However, the function gives weird results under python 3. It produces the error:

TypeError: 'in <string>' requires string as left operand, not int

So I converted the 'int' to 'str' by replacing this

if c in string.printable:

with this

if str(c) in string.printable:

(I also converted all the places where the same error message is thrown)

Now the python 3 gives the following output:

56700
0000000000000000000000000000000000000000
1236
60000
400234
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
2340
0000
5010
5000
17889
2348
23400000000
5600

I cant see any characters when I use python 3. Any help to get the code working or pointer to the solution is appreciated. All I require is to extract the strings from binary file (very small with few kb) and store it in a variable.

You have bytes in python3. Use `set(string.printable.encode())` — Padraic Cunningham, Oct 03 '16 at 16:37
I don't know who down voted this question. But I request them to show the documentation and explanation the way 'Mr Martijn Pieters' did in his answer. If shown I will remove this post/question. — sundar_ima, Oct 04 '16 at 15:41

Martijn Pieters · Accepted Answer · 2016-10-03T20:36:14.187

In Python 3, opening a file in binary mode gives you bytes results. Iterating over a bytes object gives you integers, not characters, in the range 0 to 255 (inclusive). From the bytes documentation:

While bytes literals and representations are based on ASCII text, bytes objects actually behave like immutable sequences of integers, with each value in the sequence restricted such that 0 <= x < 256

Convert string.printable to a set and test against that:

printable = {ord(c) for c in string.printable}

and

if c in printable:

Next, you want to append to a bytesarray() object to keep things reasonably performant, and decode from ASCII to produce a str result:

printable = {ord(c) for c in string.printable}

with open(filename, "rb") as f:
    result = bytearray()
    for c in f.read():
        if c in printable:
            result.append(c)
            continue
        if len(result) >= min:
            yield result.decode('ASCII')
            result.clear()
    if len(result) >= min:  # catch result at EOF
        yield result

Rather than iterate over the bytes one by one, you could instead split on anything that is not printable:

import re

nonprintable = re.compile(b'[^%s]+' % re.escape(string.printable.encode('ascii')))

with open(filename, "rb") as f:
    for result in nonprintable.split(f.read()):
        if result:
            yield result.decode('ASCII')

I'd explore reading the file in chunks rather than in one go; don't try to fit a large file into memory in one go here:

with open(filename, "rb") as f:
    buffer = b''
    for chunk in iter(lambda: f.read(2048), b''):
        splitresult = nonprintable.split(buffer + chunk)            
        buffer = splitresult.pop()
        for string in splitresult:
            if string:
                yield string.decode('ascii')
    if buffer:
        yield buffer.decode('ascii')

The buffer carries over any incomplete word from one chunk to the next; re.split() produces empty values at the start and end if the input started or ended with non-printable characters, respectively.

@MarkTolonen: better use a `bytearray`; you can't append integers to a `byte` object.. — Martijn Pieters, Oct 03 '16 at 16:40
true, it is one of those surprising things. Iterate over `str` and get length 1 strs, but iterate over `bytes` and get integers. `bytearray` makes more sense being mutable anyway. `result += bytes([c])` would work, but not very efficient. — Mark Tolonen, Oct 03 '16 at 22:36

score -1 · Answer 2 · answered Oct 03 '16 at 18:01

I am sure this will work.

As a generator:

import string, _io
def getPrintablesFromBinaryFile(path, encoding='cp1252'):
    global _io, string
    buffer = _io.BufferedReader(open(path, 'rb'))
    while True:
        byte = buffer.read(1)
        if byte == b'':
            return #EOF
        try:
            d = byte.decode(encoding)
        except:
            continue
        if d in string.printable:
            yield d

As a function is to just collect the outputs of the getPrintablesFromBinaryFile() into a iterable.

Explanation:

Import the needed modules
Define the function
Load the modules
Create the buffer
Get a byte from the buffer
Check if it is EOF
If yes, stop the generator
Try to decode using the encoding (like '\xef' does not decode using UTF-8)
If impossible, it cannot be a printable
If printable, yield it

Note: cp1252 is the encoding for many text files

Why use `_io` and not `io`? And `open()` already returns a buffered reader, why wrap this again? Why decode by some arbitrary 8-bit codec? All characters in `string.printable` are ASCII characters; better to detect these before decoding and avoid that overhead. And since you read just 1 byte at a time you can't use any multi-byte codec *anyway*; it'd have been more logical to open the file in text mode. Also, don't use blanket `except` statements; catch specific exceptions instead. The OP code yields whole strings, you yield individual bytes, which isn't helpful. — Martijn Pieters, Oct 03 '16 at 23:01

How to print only printable charcters in binary file (equvalent to strings under Linux)?

2 Answers2