How to detect UTF16 strings in PE files

Question

I need to extract Unicode strings from a PE file. While extracting I need to detect it first. For UTF-8 characters, I used the following link - How to easily detect utf8 encoding in the string?. Is there any similar way to detect UTF-16 characters. I have tried the following code. Is this right? Please do help or provide suggestions. Thanks in advance!!!

BYTE temp1 = buf[offset];

BYTE temp2 = buf[offset+1];

while (!(temp1 == 0x00 && temp2 == 0x00) && offset <= bufSize)
{
    if ((temp1 >= 0x00 && temp1 <= 0xFF) && (temp2 >= 0x00 && temp2 <= 0xFF)) 
    {
        tmp += 2;
    }
    else
    {
        break;
    }

    offset += 2;
    temp1 = buf[offset];
    temp2 = buf[offset+1];

    if (temp1 == 0x00 && temp2 == 0x00)
    {
        break;
    }
}

The first if statement in your loop is always true, the second if statement is already covered by your loop condition. What does "not working" mean? What output do you get? What output did you expect? — Botje, Nov 05 '20 at 08:01
@Botje How can it be always true? If I encounter Null character it exits. Random characters are getting detected — YESHU, Nov 05 '20 at 08:14
Finally, your loop stops as soon as both bytes are zero. That is quite common in binary files — Botje, Nov 05 '20 at 08:18
@Botje Yeah its as expected. When I encounter null bytes, I skip those — YESHU, Nov 05 '20 at 08:24
A byte is an unsigned character so it is always between 0 and 0xFF inclusive. As a first step you should probably only count something as a string if you see that two-byte pattern more than, say, four times. — Botje, Nov 05 '20 at 08:36
You can detect runs of Latin-1 characters easily (just search for alternating zero-nonzero bytes) but in general I'm afraid it is not feasible. — n. m. could be an AI, Nov 05 '20 at 10:03
@YESHU I just wrote right now very fast function for decoding/checking UTF-16, [see my answer](https://stackoverflow.com/a/64694092/941531). Please put a look there! — Arty, Nov 05 '20 at 10:42

Arty · Answer 1 · 2021-01-10T06:53:52.223

I just implemented right now a function for you, DecodeUtf16Char(), basically it is able to do two things - either just check if it is a valid utf-16 (when check_only = true) or check and return valid decoded Unicode code-point (32-bit). Also it supports either big endian (default, when big_endian = true) or little endian (big_endian = false) order of bytes within two-byte utf-16 word. bad_skip equals to number of bytes to be skipped if failed to decode a character (invalid utf-16), bad_value is a value that is used to signify that utf-16 wasn't decoded (was invalid) by default it is -1.

Example of usage/tests are included after this function definition. Basically you just pass starting (ptr) and ending pointer to this function and when returned check return value, if it is -1 then at pointer begin was invalid utf-16 sequence, if it is not -1 then this returned value contains valid 32-bit unicode code-point. Also my function increments ptr, by amount of decoded bytes in case of valid utf-16 or by bad_skip number of bytes if it is invalid.

My functions should be very fast, because it contains only few ifs (plus a bit of arithmetics in case when you ask to actually decode chars), always place my function into headers so that it is inlined into calling function to produce very fast code! Also pass in only compile-time-constants check_only and big_endian, this will remove extra decoding code through C++ optimizations.

If for example you just want to detect long runs of utf-16 bytes then you do next thing, iterate in a loop calling this function and whenever it first returned not -1 then it will be possible beginning, then iterate further and catch last not-equal-to -1 value, this will be the last point of text. Also important to pass in bad_skip = 1 when searching for utf-16 bytes because valid char may start at any byte.

I used for testing different characters - English ASCII, Russian chars (two-byte utf-16) plus two 4-byte chars (two utf-16 words). My tests append converted line to test.txt file, this file is UTF-8 encoded to be easily viewable e.g. by notepad. All of the code after my decoding function is not needed for it to work, the rest is just testing code.

My function to work needs two functions - _DecodeUtf16Char_ReadWord() (helper) plus DecodeUtf16Char() (main decoder). I only include one standard header <cstdint>, if you're not allowed to include anything then just define uint8_t and uint16_t and uint32_t, I use only these types definition from this header.

Also, for reference, see my other post which implements both from scratch (and using standard C++ library) all types of conversions between UTF-8<-->UTF-16<-->UTF-32!

Try it online!

#include <cstdint>

static inline bool _DecodeUtf16Char_ReadWord(
    uint8_t const * & ptrc, uint8_t const * end,
    uint16_t & r, bool const big_endian
) {
    if (ptrc + 1 >= end) {
        // No data left.
        if (ptrc < end)
            ++ptrc;
        return false;
    }
    if (big_endian) {
        r  = uint16_t(*ptrc) << 8; ++ptrc;
        r |= uint16_t(*ptrc)     ; ++ptrc;
    } else {
        r  = uint16_t(*ptrc)     ; ++ptrc;
        r |= uint16_t(*ptrc) << 8; ++ptrc;
    }
    return true;
}

static inline uint32_t DecodeUtf16Char(
    uint8_t const * & ptr, uint8_t const * end,
    bool const check_only = true, bool const big_endian = true,
    uint32_t const bad_skip = 1, uint32_t const bad_value = -1
) {
    auto ptrs = ptr, ptrc = ptr;
    uint32_t c = 0;
    uint16_t v = 0;
    if (!_DecodeUtf16Char_ReadWord(ptrc, end, v, big_endian)) {
        // No data left.
        c = bad_value;
    } else if (v < 0xD800 || v > 0xDFFF) {
        // Correct single-word symbol.
        if (!check_only)
            c = v;
    } else if (v >= 0xDC00) {
        // Unallowed UTF-16 sequence!
        c = bad_value;
    } else { // Possibly double-word sequence.
        if (!check_only)
            c = (v & 0x3FF) << 10;
        if (!_DecodeUtf16Char_ReadWord(ptrc, end, v, big_endian)) {
            // No data left.
            c = bad_value;
        } else if ((v < 0xDC00) || (v > 0xDFFF)) {
            // Unallowed UTF-16 sequence!
            c = bad_value;
        } else {
            // Correct double-word symbol
            if (!check_only) {
                c |= v & 0x3FF;
                c += 0x10000;
            }
        }
    }
    if (c == bad_value)
        ptr = ptrs + bad_skip; // Skip bytes.
    else
        ptr = ptrc; // Skip all eaten bytes.
    return c;
}

// --------- Next code only for testing only and is not needed for decoding ------------

#include <iostream>
#include <string>
#include <codecvt>
#include <fstream>
#include <locale>

static std::u32string DecodeUtf16Bytes(uint8_t const * ptr, uint8_t const * end) {
    std::u32string res;
    while (true) {
        if (ptr >= end)
            break;
        uint32_t c = DecodeUtf16Char(ptr, end, false, false, 2);
        if (c != -1)
            res.append(1, c);
    }
    return res;
}

#if (!_DLL) && (_MSC_VER >= 1900 /* VS 2015*/) && (_MSC_VER <= 1914 /* VS 2017 */)
std::locale::id std::codecvt<char16_t, char, _Mbstatet>::id;
std::locale::id std::codecvt<char32_t, char, _Mbstatet>::id;
#endif

template <typename CharT = char>
static std::basic_string<CharT> U32ToU8(std::u32string const & s) {
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv;
    auto res = utf_8_32_conv.to_bytes(s.c_str(), s.c_str() + s.length());
    return res;
}

template <typename WCharT = wchar_t>
static std::basic_string<WCharT> U32ToU16(std::u32string const & s) {
    std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffffUL, std::little_endian>, char32_t> utf_16_32_conv;
    auto res = utf_16_32_conv.to_bytes(s.c_str(), s.c_str() + s.length());
    return std::basic_string<WCharT>((WCharT*)(res.c_str()), (WCharT*)(res.c_str() + res.length()));
}

template <typename StrT>
void OutputString(StrT const & s) {
    std::ofstream f("test.txt", std::ios::binary | std::ios::app);
    f.write((char*)s.c_str(), size_t((uint8_t*)(s.c_str() + s.length()) - (uint8_t*)s.c_str()));
    f.write("\n\x00", sizeof(s.c_str()[0]));
}

int main() {
    std::u16string a = u"привет|мир|hello||world||again|русский|english";
    *((uint8_t*)(a.data() + 12) + 1) = 0xDD; // Introduce bad utf-16 byte.
    // Also truncate by 1 byte ("... - 1" in next line).
    OutputString(U32ToU8(DecodeUtf16Bytes((uint8_t*)a.c_str(), (uint8_t*)(a.c_str() + a.length()) - 1)));
    return 0;
}

Output:

привет|мир|hllo||world||again|русский|englis

How to detect UTF16 strings in PE files

1 Answers1