Python: Searching a binary file (.PLM) for unicode string

Question

I'm trying to extract a directory name from a .PLM file using Python 2.7 on Windows 10. A .PLM file is a proprietary file format used for Panasonic voice recorders, which stores the name of the directory for the voice recordings.

(example: say I have a voice recording, which I'd like to save in the folder "HelloÆØÅ", then this voice recorder creates a folder called "SV_VC001" and a file called "SD_VOICE.PLM" which, among a bunch of other data, stores the string "HelloÆØÅ")

Now, I'm a Dane, and so use the characters Æ, Ø and Å, which aren't supported by ascii, so I have to convert this binary data into unicode.

So far I know that the name of the directory is stored from byte 56 and onward, and terminates with a byte of all 0's. For example, one recording is stored in a directory called "2-3-15 Årstids kredsløbet michael", which has the hex-values:

322d 332d 3135 20c5 7274 6964 7320 6b72 
6564 736c f862 6574 206d 6963 6861 656c

This is the code I'm using thus far:

# Finds the filename in the .PLM-file
def  FindFileName(File):
    # Opens the file and points to byte 56, where the file name starts
    f = open(File,'rb')
    f.seek(56)
    Name = ""


    byte = f.read(1)        # Reads the first byte after byte 56
    while byte != "\x00":   # Runs the loop, until a NUL-character is found (00 is NUL in hex)
        Name += str(byte)   # Appends the current byte to the string Name
        byte = f.read(1)    # reads the next byte

    f.close()

    return Name

And this works - provided the directory name only uses ASCII characters (so no 'æ', 'ø' or 'å').

However, if there are unicode characters in the string, then this is converted to some other character. With the directory "2-3-15 Årstids kredsløbet michael", this program outputs "2-3-15 ┼rtids kredsl°bet michael"

Do you have any suggestions?
Thank you very much in advance.

EDIT

Adding the suggestions from Mark Ransom, the code is as follows. I also tried clumsily to handle the 3 edge cases found: question marks are changed to spaces, and \xc5 and \xd8 (Å and Ø in hex, respectively) are changed to å and ø respectively.

def  FindFileName(File):
    # Opens the file and points to byte 56, where the file name starts
    f = open(File,'rb')
    f.seek(56)
    Name = ""


    byte = f.read(1)        # Reads the first byte after byte 56
    while byte and (byte != "\x00"):    # Runs the loop, until a NUL-character is found (00 is NUL in hex)

        # Since there are problems with "?" in directory names, we change those to spaces
        if byte == "?": 
            Name += " "
        elif byte == "\xc5":
            Name += "å"
        elif byte == "\xd8":
            Name += "ø"
        else:
            Name += byte

    byte = f.read(1)    # reads the next byte

f.close()

return Name.decode('mbcs')

Which produces the following error for uppercase Æ, Ø and Å:

WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: u'C:\\Users\\UserName\\Desktop\\TestDir\\Mapper\\13-10*14 ESSOTERISK \xc5NDSSTR\xd8MNIN'

The string should be "13-10*14 ESSOTERISK ÅNDSSTRØMNIN", but Å and Ø (hex c5 and d8) are throwing errors.

Don't get the problem yet. So **Name** does not contain 'æ', 'ø' or 'å' once you red it? — HelloWorld, Jul 19 '16 at 19:49
Is this Windows and what is the encoding? I think that string is a windows code page encoding and not unicode at all. — tdelaney, Jul 19 '16 at 20:06
@HelloWorld - Precisely, I'll edit the question to make it more clear. Thank you for the comment. — Musai, Jul 19 '16 at 20:13
@tdelaney It's on windows 10, and when æ, ø or å is encountered, another character is printed, for å it's '┼' and for ø it's '°'. I've also edited the question to make this more clear. — Musai, Jul 19 '16 at 20:20
I don't think your remaining problem is with `Å` and `Ø`, I think it's `*`. That's an invalid character in a filename, same as `?`. See http://stackoverflow.com/a/31976060/5987 — Mark Ransom, Jul 19 '16 at 21:21
@MarkRansom That did it! It now works on all 240 folders. Thank you so, so much! — Musai, Jul 19 '16 at 21:30

score 2 · Accepted Answer · answered Jul 19 '16 at 20:26

In Python 2, reading from a binary file returns a string, so there's no need to use str on it. Also if for some reason the file is ill-formed and there's no zero byte in it, read will return an empty string. You can check for both conditions with a small modification to your test.

while byte and (byte != "\x00"):   # Runs the loop, until a NUL-character is found (00 is NUL in hex)
    Name += byte        # Appends the current byte to the string Name
    byte = f.read(1)    # reads the next byte

Once you have the full byte sequence, you must turn it into a Unicode string. For that you need decode:

Name = Name.decode("utf-8")

As mentioned in the comments, it doesn't appear that your string is actually UTF-8 but rather one of Microsoft's code pages. You can decode from the code page that Windows is currently using:

Name = Name.decode("mbcs")

You can explicitly give a codepage to use instead, see the documentation.

You may run into trouble when trying to print the string on the console, since the Windows console does not use the same code page as the rest of the system; it might not have the characters you need to print.

the 'mbcs' decoding is working for the most part, but get's stuck on uppercase æ, ø and å. The string "13-10*14 ESSOTERISK ÅNDSSTRØMNIN" gives an error for the letters Å and Ø (hex c5 and d8 respectively). I'll update the question with the full error code — Musai, Jul 19 '16 at 20:45
Also, I accidentally marked the question as answered, since I was very happy it worked for the original string "2-3-15 Årstids kredsløbet michael", but didn't check for all the 240 directories. I'm really sorry about that — Musai, Jul 19 '16 at 20:56

Python: Searching a binary file (.PLM) for unicode string

1 Answers1