-1

I need to read a utf-16 encoded string that is stored in memory in a python script for LLDB. According to their documentation I'm able to use ReadMemory(address, length, error) but I need to know its length in advance. If not python's decode function fails when it stumbles upon a character it cannot decode (even using the 'ignore' option) and the process stops:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u018e' in position 12: ordinal not in range(128)

Can anyone suggest a way of achieving this? (either using a "python" or "lldb python" implementation). I don't have the original string's length.

Thanks.

Anubis
  • 653
  • 1
  • 7
  • 16
  • 1
    Can you show your code? It's great that you show the error, but please show full traceback and the sample code which is raising the error. – David Zemens Feb 19 '16 at 02:00
  • There are many ways to represent strings in memory. Does their doc tell you how they do it? – tdelaney Feb 19 '16 at 02:20
  • Here is a memory dump example or what I need to parse: `(lldb) memory read 0x10142c838 0x10142c838: 61 00 62 00 63 00 64 00 65 00 00 00 00 00 00 00 a.b.c.d.e....... 0x10142c848: 00 00 00 00 00 00 00 00 8e 01 00 00 00 00 00 00 ................` Seems to be and UTF-16-le encoded string. But I'm not sure if it's always null terminated. I hope this gives a bit more insight. – Anubis Feb 19 '16 at 15:38

1 Answers1

2

Is the string 0-terminated? If so, you could read 2 bytes at a time, until you encounter 0x0000, and then you'd know you have a complete string.

If you do this, you'd want to give yourself a constraint (e.g. "I will give up after reading - say - 1MB of data", in case you're running into corrupted memory).

Enrico Granata
  • 3,303
  • 18
  • 25
  • I thought so too, but apparently there is no defined null [termination](http://stackoverflow.com/questions/5923948/utf-16-string-terminator). Is there any function that evaluates this? – Anubis Feb 19 '16 at 11:06
  • So if I understand, your task is to read a string whose length you don't know and with no known termination? That is a very badly defined problem. What if your valid string is followed by garbage that looks like characters? Are you OK with overly-aggressive printing? Are all characters in the string even going to be printable? – Enrico Granata Feb 19 '16 at 18:43