2

What would be the algorithm/implementation of the C++ code C++functionX in the following flow chart:

(JavaString) --getBytes--> (bytes) --C++functionX--> (C++String)

JavaString contents should match C++String contents as far as possible (preferably 100% for all possible values of JavaString)

[EDIT] The endianness of bytes can be ignored as there are ways to handle that.

Jus12
  • 17,824
  • 28
  • 99
  • 157

4 Answers4

3

Java:

String original = new String("BANANAS");
byte[] utf8Bytes = original.getBytes("UTF8");
//save the length as a 32 bit integer, then utf8 Bytes to a file

C++:

int32_t tlength;
std::string utf8Bytes;
//load the tlength as a 32 bit integer, then the utf8 bytes from the file
//well, that's easy for UTF8

//to turn that into a utf-18 string in windows
int wlength = MultiByteToWideChar(CP_UTF8, 0, utf8Bytes.c_str(), utf8Bytes.size(), nullptr, 0);
std::wstring result(wlength, '\0');
MultiByteToWideChar(CP_UTF8, 0, utf8Bytes.c_str(), utf8Bytes.size(), &result[0], wlength);
//so that's not hard either

To do this in linux, one uses the iconv library, which is incredibly powerful, but more difficult to use. Here's a function that converts a std::string in UTF8 to a std::wstring in UTF32: http://coliru.stacked-crooked.com/view?id=986a4a07e391213559d4e65acaf231d5-e54ee7a04e4b807da0930236d4cc94dc

Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
  • I have to [object](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability) to that. `mbtowc` (I think you mean `mbstowcs`) has no notion of an explicit encoding. It's purpose is **not** to convert from UTF8. You really want to use a library that deals with explicit encodings, such as `iconv()`. The only purpose of the `mb/wc` functions in ANSI-C is to convert between `char` and `wchar_t`, *in a platform-dependent fashion*. – Kerrek SB Sep 15 '11 at 23:58
  • @Kerrek: I origionally had MultiByteToWideChar, but then found http://www.cplusplus.com/reference/clibrary/cstdlib/mbstowcs/ when looking for something cross-platform – Mooing Duck Sep 16 '11 at 00:00
  • `mbstowcs` doesn't do what you claim. It's purpose is a different one. It may *work* in practice, but not because it's guaranteed to. – Kerrek SB Sep 16 '11 at 00:06
  • @Kerrek: I misread the docs. I switched to OS-dependant code, but I've never worked with linux before, so that's just a guess. – Mooing Duck Sep 16 '11 at 00:10
  • @Mooing: OK - I wouldn't call `iconv()` that terribly OS dependent -- it's widely available, at *least* Posix and Windows. The crux is, one simply has to acknowledge that C++ doesn't have any built-in definite encoding handling in the standard, and so *any* such issue will require an encoding handling library. `iconv` is a very good choice. – Kerrek SB Sep 16 '11 at 00:12
  • @Kerrek: http://en.wikipedia.org/wiki/Iconv `Under Windows, the iconv binary (and thus, likely also the API) is provided by the Cygwin and GnuWin32 environments.` I tend to think of things that don't support Linux/Windows as "OS dependant" and things that do both as "OS independent" even though I know it's not accurate :( – Mooing Duck Sep 16 '11 at 00:18
  • Mooing: I'm fairly sure I've seen standalone versions of `iconv` -- let me check, though, I guess this would actually be quite important to settle. *Edit.* Ah, [here we go](http://gnuwin32.sourceforge.net/packages/libiconv.htm). – Kerrek SB Sep 16 '11 at 00:25
1

There's no such thing as One True C++ String class. STL alone has std::string and std::wstring. That said, most string classes have a constructor that takes raw byte pointer as a parameter. The bytes come in the const char * form. So, a good example of your C++functionX is the constructor std::string::string(const char*, int).

Note the encoding issues. getBytes() takes an encoding as a parameter; you better match that on the C++ side, or you'll get jumble. If not sure, use UTF-8.

Depending on what kinds of Java strings you have, you might want to choose either regular or wide strings (e. g. std::wstring). The latter is a slightly better representation of what Java String offers.

Seva Alekseyev
  • 59,826
  • 25
  • 160
  • 281
  • According to http://download.oracle.com/javase/6/docs/api/java/lang/String.html, `A String represents a string in the UTF-16 format`, which is like `std::wstring` _sometimes_. `std::wstring` is 16 bytes on Windows, but not Linux. – Mooing Duck Sep 15 '11 at 23:02
  • Depending on what the OP wants to do to his strings, and what's the nature of the content, a single-byte string in UTF-8 might do just as well. You'd be surprised how many text propcessing tasks are quite ASCII-friendly. – Seva Alekseyev Sep 15 '11 at 23:04
  • I don't know about the STL, but the C++ standard library has 4 string classes (all specializations of `std::basic_string`): `std::string`, `std::wstring`, `std::u16string`, and `std::u32string`. In this case, I think `std::u16string` would fit very nicely :) – R. Martinho Fernandes Sep 15 '11 at 23:10
  • 1
    @Mooing: 16 **bytes**? That's one wide character. – Ben Voigt Sep 15 '11 at 23:14
  • depending on what he's doing he may run into endian issues with that, but it might be the best fit. – Mooing Duck Sep 15 '11 at 23:19
  • @Mooing duck: you can assume that I am handling endianness of bytes correctly. – Jus12 Sep 16 '11 at 17:05
1

C++, as far as the standard goes, doesn't know about encodings. Java does. So, to interface the two, make Java emit some well-defined encoding, such as UTF8:

byte[] utf8str = str.getBytes("UTF8");

In C++, use a library such as iconv() to transform the UTF8-string either into another string of a well-defined encoding (e.g. std::u32string with UTF-32, if you have C++11, or std::basic_string<uint32_t> or std::vector<uint32_t> otherwise), or, alternatively, convert it to WCHAR_T encoding, to be stored in a std::wstring, and proceed further to convert this to a multi-byte string via the standard function wcstombs() if you wish to interface with your environment.

The choice depends on what you need to do with the string. For serialization or text processing, go with the definite encoding (e.g. UTF-32). For writing to the standard output using the system's locale, use the multibyte conversion. (Here is a slightly longer discussion of encodings in C++.)

Community
  • 1
  • 1
Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
0

the C++ string should probably be an std::wstring instance and you would alse need to keep track of the encoding you would use to transform from JavaString to bytes.

This article will probably help you more:

std::wstring VS std::string

Community
  • 1
  • 1
Mihai Toader
  • 12,041
  • 1
  • 29
  • 33