I must admit that my experience in coding C++ is very low, some few hundred lines.
I solved the problem but I'm sure there is a better solution. At least I write this because I found no solutions via Google or stackoverflow, and other users maybe have a similar problem.
This is the portion of code in C99 to be ported to C++11:
// src/levtest.c
char utf_str2[] = "Chſerſplzon";
uint32_t utf_len2 = strlen(utf_str2);
// convert to ucs
uint32_t b_ucs[(utf_len2+1)*4]; // plenty of space
int b_chars;
b_chars = u8_toucs(b_ucs, (utf_len2+1)*4, utf_str2, utf_len2);
int distance;
distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni] distance: %u expect: 4\n", distance);
Background is that the code in src/levbv.c
should work via Perl XS, C, C++ (maybe other language bindings like Python). It's highly optimised and should use C-types. vector<wchar_t>
is needed, because one C++ distribution (training of Tesseract-OCR) uses vector<wchar_t>
for the relevant portions.
Here is my corresponding code in C++:
// src/levbvcpp.cpp
char utf_str2[] = u8"Chſerſplzon";
uint32_t utf_len2 = strlen(utf_str2);
// convert u8 to wstring
std::wstring b_string = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(utf_str2);
// convert wstring to vector<wchar_t>
std::vector<wchar_t> b_uv(b_string.begin(), b_string.end());
int b_chars = b_uv.size();
uint32_t b_ucs[(utf_len2+1)*4];
// convert vector<wchar_t> to uint32_t array[]
unsigned int index = 0;
for (uint32_t b_char : b_uv) {
b_ucs[index] = b_char;
index++;
}
int distance;
distance = dist_uni(a_ucs, a_chars, b_ucs, b_chars);
printf("[dist_uni] distance: %u expect: 4\n", distance);
The source is on https://github.com/wollmers/Text-Levenshtein-BVXS.
Questions:
- Is there a better way to convert?
- What datatype and encoding would
L"こんにちは世界"
oru"こんにちは世界"
have? The reference manual is somewhat untechnical. - Code is compiled with
-std=c++11 -Wall -g -finput-charset=utf-8 -O3
andclang
on MacOS. Is there something to consider on other platforms/compilers with source encoded in UTF-8? Did not find a clear answer on stackoverflow.