0

I'm working with a couple of binary files and I want to parse UTF-8 strings that exist.

I currently have a function that takes the starting location of a file, then returns the string found:

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter != None and index != None):
       return file.read(size).explode('0x00000000')[index] #incorrect
   else:
       return file.read(size)

Some strings in the file are separated by 0x00 00 00 00, is it possible to split these like PHP's explode? I'm new to Python so any pointers on code improvements are welcome.

Sample file:

48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 6F 00 72 00 6C 00 64 00 | 00 00 00 00 | 31 00 32 00 33 00 which is Hello World123, I've noted the 00 00 00 00 separator by enclosing it with | bars.

So:

str_extract(file, 0x00, 0x20, 0x00000000, 0) => 'Hello World'

Similarly:

str_extract(file, 0x00, 0x20, 0x00000000, 1) => '123'
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Helen Che
  • 1,951
  • 5
  • 29
  • 41

2 Answers2

5

I'm going to assume you are using Python 2 here, but write the code to work on both Python 2 and Python 3.

You have UTF-16 data, not UTF-8. You can read that as binary data and split on the four NUL bytes with the str.split() method:

file.read(size).split(b'\x00' * 4)[index]

The resulting data is encoded as UTF-16 little-endian (you may or may not have omitted the UTF-16 BOM at the start; you can decode the data with:

result.decode('utf-16-le')

This will however fail as we just cut off the text at that last NUL byte; Python splits on the first 4 NULs found, and won't skip that last NUL byte that is part of the text.

The better idea is to decode to Unicode first, then split on a Unicode double-NUL codepoint:

file.read(size).decode('utf-16-le').split(u'\x00' * 2)[index]

Putting this together as a function would be:

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter is not None and index is not None):
       delimiter = delimiter.decode('utf-16-le')  # or pass in Unicode
       return file.read(size).decode('utf-16-le').split(delimiter)[index]
   else:
       return file.read(size).decode('utf-16-le')

with open('filename', 'rb') as fobj:
    result = str_extract(fobj, 0, 0x20, b'\x00' * 4, 0)

If the file as a BOM at the start, consider opening the file as UTF-16 instead to start with:

import io

with io.open('filename', 'r', encoding='utf16') as fobj:
    # ....

and remove the explicit decoding.

Python 2 demo:

>>> from io import BytesIO
>>> data = b'H\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d\x00\x00\x00\x00\x001\x002\x003\x00'
>>> fobj = BytesIO(data)
>>> str_extract(fobj, 0, 0x20, '\x00' * 4, 0)
u'Hello World'
>>> str_extract(fobj, 0, 0x20, '\x00' * 4, 1)
u'123'
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Reading up on BOM right now, how would I detect this? I prefer this way as it is cleaner than doing explicit decoding. – Helen Che May 08 '15 at 08:55
  • 1
    @VeraWang: your file would start with the bytes FF FE (encoding U+FEFF ZERO WIDTH NO-BREAK SPACE to UTF-16 little-endian). – Martijn Pieters May 08 '15 at 08:57
  • Sorry another question here: these files contain characters in English, German, French, Japanese and can also contain stuff not in the string. Does Python have a predetermined hex range for these set of characters? I only want "readable" characters if that makes sense. – Helen Che May 08 '15 at 08:59
  • @VeraWang: no, that doesn't make much sense. :-) Do you mean you want to filter on *printable* characters? Tabs, newlines, non-breaking spaces, etc, all carry meaning, you'll have to be more specific. – Martijn Pieters May 08 '15 at 09:00
  • Yes, sorry it's been a long day :-). Including null bytes as well. – Helen Che May 08 '15 at 09:03
  • 1
    @VeraWang: you can use `str.translate()` to efficiently remove certain codepoints; `result.translate({0: None})` tells the method to map the `0` codepoint (a NUL) to `None`, which means *delete it*. – Martijn Pieters May 08 '15 at 09:08
  • 1
    @VeraWang: alternatively, have the dictionary (whose keys must be integers, denoting Unicode codepoints) map to other codepoints (so again an integer, but a single Unicode character is fine too) to replace codepoints with other codepoints. So `result.translate({0: u'?'})` would replace NUL characters with a question mark. – Martijn Pieters May 08 '15 at 09:10
  • Thanks, I've learnt quite a bit about Python from this post alone! – Helen Che May 08 '15 at 09:22
1

First you need to open the file in binary mode.

Then you split the str (or bytes, dependend on the version of Python) with a delimiter of four zero bytes b'\0\0\0\0':

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter is not None and index is not None):
       return file.read(size).split(delimiter)[index]
   else:
       return file.read(size)

Furthermore you need to handle the encoding, since str_extract only returns the binary data and your test data is in UTF-16 little endian like Martijn Pieters noted:

>>> str_extract(file, 0x00, 0x20, b'\0\0\0\0', 0).decode('utf-16-le')
u'Hello World'

Besides: test with is not None for a variable not to be None.

Community
  • 1
  • 1
tynn
  • 38,113
  • 8
  • 108
  • 143
  • Not quite `'Hello World'`; more like `'H\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d'`, which is UTF-16 little endian data. – Martijn Pieters May 08 '15 at 08:40