Splitting binary file content in two parts using single byte separator in python

Question

I have a file consisting in three parts:

Xml header (unicode);
ASCII character 29 (group separator);
A numeric stream to the end of file

I want to get one xml string from the first part, and the numeric stream (to be parsed with struct.unpack or array.fromfile).

Should I create an empty string and add to it reading the file byte by byte until I find the separator, like shown here?

Or is there a way to read everything and use something like xmlstring = open('file.dat', 'rb').read().split(chr(29))[0] (which by the way doesn't work) ?

EDIT: this is what I see using a hex editor: the separator is there (selected byte)

enter image description here

In what way does `.split(29)` not work? Does it produce an error message? Please provide a short, complete program that demonstrates the error you are having. — Robᵩ, Apr 07 '15 at 18:11
Can you show an sample input and expected output of your file? — Mazdak, Apr 07 '15 at 18:11
It would be a bit difficult for me to create code right now (I am already receiving the file generated elsewhere). — heltonbiker, Apr 07 '15 at 18:13
The code you have pasted works fine for me. In what way does it not work for you? — Robᵩ, Apr 07 '15 at 18:15
@Robᵩ it returns the whole file, not just the part before `chr(29)` . — heltonbiker, Apr 07 '15 at 18:16
You should be using `with open` with anyways, it simplifies exception handling with some encapsulation. http://stackoverflow.com/a/3012921 — Tui Popenoe, Apr 07 '15 at 18:19
@IgnacioVazquez-Abrams I think this is my problem. It appears as a `)/` in the editor and when I just print the string to the console. What have I lost here? — heltonbiker, Apr 07 '15 at 18:26
It would make more sense if it was an actual 29 (i.e. 0x1d) instead, since 0x29 is a ")". But that is a source error. — Ignacio Vazquez-Abrams, Apr 07 '15 at 18:27
@IgnacioVazquez-Abrams I talked to a coworker, and indeed the C# code generating the file was erroneously using `0x29` as separator. We are fixing that, thanks!! — heltonbiker, Apr 07 '15 at 18:41

Tui Popenoe · Answer 1 · 2015-04-07T18:26:46.413

1

Make sure you are reading the file in before trying to split it. In your code, your don't have a .read()

with open('file.dat', 'rb') as f:
    file = f.read()
    if chr(29) in file:
        xmlstring = file.split(chr(29))[0]
    elif hex(29) in file:
        xmlstring = file.split(hex(29))[0]
    else:
        xmlstring = '\x1d not found!'

Ensure that a ASCII 29 char exists in your file (\x1d)

edited Apr 07 '15 at 18:26

answered Apr 07 '15 at 18:13

Tui Popenoe

2,098
2
23
44

Thanks, there was a typo in my sample code. I was already doing this, but it didn't work as expected. – heltonbiker Apr 07 '15 at 18:14
Regarding your last phrase, the group separator byte can be seen in a HexEditor. – heltonbiker Apr 07 '15 at 18:21

Lukas Graf · Accepted Answer · 2015-04-07T18:34:25.507

1

Your attempt at searching for the value chr(29) didn't work because in that expression 29 is a value in decimal notation. The value you got from your hex editor however is displayed in hex, so it's 0x29 (or 41 in decimal).

You can simply do the conversion in Python - 0xnn is just another notation for entering an integer literal:

>>> 0x29
41

You can then use str.partition to split the data into your respective parts:

with open('file.dat', 'rb') as infile:
    data = infile.read()

xml, sep, binary_data = data.partition(SEP)

Demonstration:

import random

SEP = chr(0x29)


with open('file.dat', 'wb') as outfile:
    outfile.write("<doc></doc>")
    outfile.write(SEP)
    data = ''.join(chr(random.randint(0, 255)) for i in range(1024))
    outfile.write(data)


with open('file.dat', 'rb') as infile:
    data = infile.read()

xml, sep, binary_data = data.partition(SEP)

print xml
print len(binary_data)

Output:

<doc></doc>
1024

edited Apr 07 '15 at 18:34

answered Apr 07 '15 at 18:14

Lukas Graf

30,317
8
77
92

This gives me the whole file as first element, and two aditional empty strings – heltonbiker Apr 07 '15 at 18:18
Then your file simply does not contain the ASCII character 29 - might it be that 29 is in hex notation instead of decimal? Try `chr(0x29)` as the separator instead like in my updated answer. – Lukas Graf Apr 07 '15 at 18:21
The byte is there, see my attached screencapture. – heltonbiker Apr 07 '15 at 18:24
1

@Tui Popenoe you don't say... My suggestion was that `29` was already the hex representation, and therefore the correct value to search for would be decimal `41`. If you just substitute decimal `29` with `0x1d` you don't change a thing. – Lukas Graf Apr 07 '15 at 18:28
@heltonbiker yes, the hex editor displays the values in hex, so `chr(0x29)` or `chr(41)` is the correct value to search for. – Lukas Graf Apr 07 '15 at 18:29
@LukasGraf it worked with `file.read().partition(chr(41))`. Now how the hell should I have made the conversion? – heltonbiker Apr 07 '15 at 18:30
it also worked as you said: `...partition(chr(0x29))` – heltonbiker Apr 07 '15 at 18:31

score 1 · Answer 3 · answered Apr 07 '15 at 18:18

1

mmap the file, search for the 29, create a buffer or memoryview from the first part to feed to the parser, and pass the rest through struct.

answered Apr 07 '15 at 18:18

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Would it be better than simply reading one byte at a time until finding the separator byte, or else loading the whole file to a `StringIO` and performing the same search in memory? – heltonbiker Apr 07 '15 at 18:20
1

A mmapped file exists as a byte array in the file cache; either of those options will be both slower and less flexible. – Ignacio Vazquez-Abrams Apr 07 '15 at 18:21

Splitting binary file content in two parts using single byte separator in python

3 Answers3