I don't know how to solve that:
Imagine, we have 4 websites:
- A: UTF-8
- B: ISO-8859-1
- C: ASCII
- D: UTF-16
My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">"
or "<"
.
The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters. Then I send these words to a server. The database and the web-frontend are using UTF-8. So my questions are:
- How can I convert "any" (or the most used) character encoding to UTF-8?
- How can I work with UTF-8-strings in C++? I think
wchar_t
does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long... - Are there functions like
isspace()
,isalnum()
,strlen()
,tolower()
for such UTF-8-strings?
Please note: I do not do any output(like std::cout
) in C++. Just filtering out the words and send them to the server.
I know about UTF8-CPP but it has no is*()
functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.
Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...