0

I am reading in a text file that contains lines with binaray data dumped in a encoded fashion, but still as a string (at least in emacs):

E.g.:

\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242

This is perfectly fine for me and when I read in that file I want to keep this string and not decode or change it in any way. However, when I am reading in the file python does the decoding. How can I prevent that?

with open("/path/to/file") as file:
     for line in file:
         print line

the output will look like:

'���k���G�r��#�\0320^��\021�C\035\000�\016ׁ��'

but should look like:

\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242

Edit: However, this encoded data is not the only data contained but part of a larger text dump.

Sim
  • 4,199
  • 4
  • 39
  • 77
  • 1
    Possible duplicate of [Printing a literal python string in octal](https://stackoverflow.com/questions/46900475/printing-a-literal-python-string-in-octal) – metatoaster Jan 19 '18 at 15:18
  • Alternatively, if octal isn't what you are after, [Process escape sequences in a string in Python](https://stackoverflow.com/questions/4020539/process-escape-sequences-in-a-string-in-python) – metatoaster Jan 19 '18 at 15:19
  • Are you sure this is python and not your terminal? – match Jan 19 '18 at 15:22
  • @match I am sure in the sense that the given line, when trying to insert it into encoding restricted databases, will trigger encoding exception and thus is obviously not parsed as a trivial string. Additionally, when opening the file with emacs the line is displayed as desired. – Sim Jan 19 '18 at 15:25

3 Answers3

1

If you really want the octal representation you can define a fuction that prints it back out.

import string

def octal_print(s):
    print(''.join(map(lambda x: x if x in string.printable else '\\'+oct(ord(x))[2:], s)))

s = '\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207'
octal_print(s)
# prints:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\320^\242\367\21\227C\35\0\207
James
  • 32,991
  • 4
  • 47
  • 70
  • I am very sorry for posting a question that had multiple interpretations. With 'contains' I meant partially and not only. I have a text file containing those binary dumps as a minor part. Your solution would not allow me to parse the rest of the line that contains the octal representation. – Sim Jan 19 '18 at 15:42
  • @Sim parsing is another issue. Do the parsing first then print the pieces as you wish – Code-Apprentice Jan 19 '18 at 15:45
  • @Code-Apprentice The issue is that those octual pieces are contained in a larger string that cannot be parsed further and thus needs to be printed as a whole. – Sim Jan 19 '18 at 15:48
  • @Sim why is that an issue? – Code-Apprentice Jan 19 '18 at 15:50
  • @Code-Apprentice Because that function will also encode valid ascii characters as octal and thus render the whole line unreadable for humans even though large potions of the line might be very well readable for a human – Sim Jan 19 '18 at 15:59
  • @sim So it sounds like you **can** parse the string more. Split the string into substrings which consist of consecutive alphanumeric characters and substrings which are not. – Code-Apprentice Jan 19 '18 at 16:17
  • @sim Or another option is to parse the string into individual characters. Then when you print the character decide how it is printed. – Code-Apprentice Jan 19 '18 at 16:20
  • I have updated my answer to account for printable characters. I am still not sure if this fits the bill – James Jan 19 '18 at 16:25
1

You can read the file as binary with 'rb' option and it will retain the data as it is

EX:

with open(PathToFile, 'rb') as file:
    raw_binary_data = file.read()

print(raw_binary_data)
Rakesh
  • 81,458
  • 17
  • 76
  • 113
  • I am very sorry for posting a question that had multiple interpretations. With 'contains' I meant partially and not only. I have a text file containing those binary dumps as a minor part. Reading the whole file binary would not allow me to use my parser for the file. The encoded binary dumps are a minor part of perfectly readable ascii lines. – Sim Jan 19 '18 at 15:40
  • 1
    @Sim: In that case, this is still the correct answer. You will need to modify your parser to handle a binary file. Files in Python can either be opened in text mode or binary mode, and if you need to handle binary data, it *must* be opened in binary mode. You simply don't have a choice here. You can always convert binary to text afterwards. – Dietrich Epp Jan 19 '18 at 15:48
0

based on the answer of James I adapted the octal_print function to discriminate between actual octals and innocent characters.

def octal_print(s):
    charlist = list()
    for character in s:
        try:
            character.decode('ascii')
            charlist.append(character)
        except:
            charlist.append('\\'+oct(ord(character))[1:])
    return ''.join(charlist)
Sim
  • 4,199
  • 4
  • 39
  • 77