0

I must admit that my experience in coding C++ is very low, some few hundred lines.

I solved the problem but I'm sure there is a better solution. At least I write this because I found no solutions via Google or stackoverflow, and other users maybe have a similar problem.

This is the portion of code in C99 to be ported to C++11:

// src/levtest.c

char utf_str2[] = "Chſerſplzon";
uint32_t utf_len2 = strlen(utf_str2);

// convert to ucs
uint32_t b_ucs[(utf_len2+1)*4]; // plenty of space
int b_chars;
b_chars = u8_toucs(b_ucs, (utf_len2+1)*4, utf_str2, utf_len2);

int distance;

distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni]      distance: %u expect: 4\n", distance);

Background is that the code in src/levbv.c should work via Perl XS, C, C++ (maybe other language bindings like Python). It's highly optimised and should use C-types. vector<wchar_t> is needed, because one C++ distribution (training of Tesseract-OCR) uses vector<wchar_t> for the relevant portions.

Here is my corresponding code in C++:

// src/levbvcpp.cpp

char utf_str2[] = u8"Chſerſplzon";
    
uint32_t utf_len2 = strlen(utf_str2);
    
// convert u8 to wstring
std::wstring b_string = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(utf_str2);

// convert wstring to vector<wchar_t>
std::vector<wchar_t> b_uv(b_string.begin(), b_string.end());
int b_chars = b_uv.size();

uint32_t b_ucs[(utf_len2+1)*4];
    

// convert vector<wchar_t> to uint32_t array[]
unsigned int index = 0;
for (uint32_t b_char : b_uv) {
    b_ucs[index] = b_char;
    index++;
}

    
int distance;

distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni]      distance: %u expect: 4\n", distance);

The source is on https://github.com/wollmers/Text-Levenshtein-BVXS.

Questions:

  1. Is there a better way to convert?
  2. What datatype and encoding would L"こんにちは世界" or u"こんにちは世界" have? The reference manual is somewhat untechnical.
  3. Code is compiled with -std=c++11 -Wall -g -finput-charset=utf-8 -O3 and clang on MacOS. Is there something to consider on other platforms/compilers with source encoded in UTF-8? Did not find a clear answer on stackoverflow.
  • `uint32_t b_ucs[(utf_len2+1)*4];` is [not legal in standard C++](https://stackoverflow.com/questions/1887097/) since `utf_len2` is not a compile-time constant. Use `std::vector` instead. In any case, your conversion from `wchar_t[]` to `uint32_t[]` is wrong, you are just upscaling each `wchar_t` as-is to 32bit, not actually converting from UTF-16 encoding to UTF-32 encoding. Which BTW, `std::codecvt_utf8` only supports UCS-2, you need to use `std::codecvt_utf8_utf16` to handle UTF-16. Or, you could just skip UTF-16 and use `std::wstring_convert` to convert UTF-8 straight to UTF-32. – Remy Lebeau Mar 08 '22 at 01:21
  • `L"..."` is `const wchar_t[]`, and will be encoded in either UTF-16 (Windows) or UTF-32 (Posix), depending on platform. `u"..."` is `const char16_t[]` and will be encoded in UTF-16. See [String literal](https://en.cppreference.com/w/cpp/language/string_literal) on cppreference.com – Remy Lebeau Mar 08 '22 at 01:28

0 Answers0