0

I have a file consisting in three parts:

  1. Xml header (unicode);
  2. ASCII character 29 (group separator);
  3. A numeric stream to the end of file

I want to get one xml string from the first part, and the numeric stream (to be parsed with struct.unpack or array.fromfile).

Should I create an empty string and add to it reading the file byte by byte until I find the separator, like shown here?

Or is there a way to read everything and use something like xmlstring = open('file.dat', 'rb').read().split(chr(29))[0] (which by the way doesn't work) ?

EDIT: this is what I see using a hex editor: the separator is there (selected byte)

enter image description here

Community
  • 1
  • 1
heltonbiker
  • 26,657
  • 28
  • 137
  • 252

3 Answers3

1

Make sure you are reading the file in before trying to split it. In your code, your don't have a .read()

with open('file.dat', 'rb') as f:
    file = f.read()
    if chr(29) in file:
        xmlstring = file.split(chr(29))[0]
    elif hex(29) in file:
        xmlstring = file.split(hex(29))[0]
    else:
        xmlstring = '\x1d not found!'

Ensure that a ASCII 29 char exists in your file (\x1d)

Tui Popenoe
  • 2,098
  • 2
  • 23
  • 44
1

Your attempt at searching for the value chr(29) didn't work because in that expression 29 is a value in decimal notation. The value you got from your hex editor however is displayed in hex, so it's 0x29 (or 41 in decimal).

You can simply do the conversion in Python - 0xnn is just another notation for entering an integer literal:

>>> 0x29
41

You can then use str.partition to split the data into your respective parts:

with open('file.dat', 'rb') as infile:
    data = infile.read()

xml, sep, binary_data = data.partition(SEP)

Demonstration:

import random

SEP = chr(0x29)


with open('file.dat', 'wb') as outfile:
    outfile.write("<doc></doc>")
    outfile.write(SEP)
    data = ''.join(chr(random.randint(0, 255)) for i in range(1024))
    outfile.write(data)


with open('file.dat', 'rb') as infile:
    data = infile.read()

xml, sep, binary_data = data.partition(SEP)

print xml
print len(binary_data)

Output:

<doc></doc>
1024
Lukas Graf
  • 30,317
  • 8
  • 77
  • 92
  • This gives me the whole file as first element, and two aditional empty strings – heltonbiker Apr 07 '15 at 18:18
  • Then your file simply does not contain the ASCII character 29 - might it be that 29 is in hex notation instead of decimal? Try `chr(0x29)` as the separator instead like in my updated answer. – Lukas Graf Apr 07 '15 at 18:21
  • The byte is there, see my attached screencapture. – heltonbiker Apr 07 '15 at 18:24
  • 1
    @Tui Popenoe you don't say... My suggestion was that `29` was already the hex representation, and therefore the correct value to search for would be decimal `41`. If you just substitute decimal `29` with `0x1d` you don't change a thing. – Lukas Graf Apr 07 '15 at 18:28
  • @heltonbiker yes, the hex editor displays the values in hex, so `chr(0x29)` or `chr(41)` is the correct value to search for. – Lukas Graf Apr 07 '15 at 18:29
  • @LukasGraf it worked with `file.read().partition(chr(41))`. Now how the hell should I have made the conversion? – heltonbiker Apr 07 '15 at 18:30
  • it also worked as you said: `...partition(chr(0x29))` – heltonbiker Apr 07 '15 at 18:31
1

mmap the file, search for the 29, create a buffer or memoryview from the first part to feed to the parser, and pass the rest through struct.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • Would it be better than simply reading one byte at a time until finding the separator byte, or else loading the whole file to a `StringIO` and performing the same search in memory? – heltonbiker Apr 07 '15 at 18:20
  • 1
    A mmapped file exists as a byte array in the file cache; either of those options will be both slower and less flexible. – Ignacio Vazquez-Abrams Apr 07 '15 at 18:21