5

I was writing some unit-tests when I stumbled upon a scenario that managed to bug me a couple of times already.

I need to generate some strings for testing a JSON writer object. Since the writer supports both UTF16 and UTF8 inputs, I want to test it with both.

Consider the following test:

class UTF8;
class UTF16;

template < typename String, typename SourceEncoding >
void writeJson(std::map<String, String> & data)
{
    // Write to file
}

void generateStringData(std::map<std::string, std::string> & data)
{
    data.emplace("Lorem", "Lorem Ipsum is simply dummy text of the printing and typesetting industry.");
    data.emplace("Ipsum", "Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book");
    data.emplace("Contrary", "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old");
}

void generateStringData(std::map<std::wstring, std::wstring> & data)
{
    data.emplace(L"Lorem", L"Lorem Ipsum is simply dummy text of the printing and typesetting industry.");
    data.emplace(L"Ipsum", L"Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book");
    data.emplace(L"Contrary", L"Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old");
}

template < typename String, typename SourceEncoding >
void testWriter() {
    std::map<String, String> data;
    generateStringData(data);
    writeJson<String, SourceEncoding>(data);
}

int main() {
    testWriter<std::string, UTF8>();
    testWriter<std::wstring, UTF16>();
}

I manage to wrap everything nicely except for the duplicate generateStringData() method. And I was wandering if it's possible to combine both generateStringData() methods into a single one?

I know I could use a single method to generate strings in UTF8 and then use an additional method to convert the strings to UTF16, but I'm trying to find out if there's another way.

What have I considered/tried?

  • Using _T() or TCHAR or #ifdef UNICODE won't help, since I need both flavors on the same platform that supports Unicode (e.g. Win >= 7)
  • Initializing std::wstring from something that is not L"" won't work since it expects a wchar_t
  • Initializing char by char won't work since it also requires L''
  • Using ""s won't work since the return type depends on type charT
Daniel Trugman
  • 8,186
  • 20
  • 41
  • 1
    UTF-8 and UTF-16 strings do not contain the same bytes for a given non-ASCII text. Your test cases do only contain 7-bit ASCII, they are useless. –  Oct 10 '17 at 11:49
  • 1
    @manni66, these are some dummy values, this is not the essence of this question... – Daniel Trugman Oct 10 '17 at 11:51
  • It's recommended to use UTF-8 to read/write web related data. If you must use UTF-16 then write the file in UTF-8 line by line, and run a final UTF-16 conversion on the whole file (that fails however if the file is larger than 2 gig). – Barmak Shemirani Oct 10 '17 at 15:38
  • 1
    @BarmakShemirani, this is not web related data. This is information acquired from Windows, which uses wchar_t by default for some APIs. – Daniel Trugman Oct 10 '17 at 15:53
  • The pre-processor is the only way to go, I'm afraid. [I had a similar issue once](https://stackoverflow.com/questions/13275015/c-preprocessor-literal-construction). – StoryTeller - Unslander Monica Oct 10 '17 at 18:29
  • @StoryTeller, thanks for the tip, at least that way I can make sure my strings are the same for both encodings. – Daniel Trugman Oct 10 '17 at 18:34

2 Answers2

4

The short answer is no, you cannot merge the two generateStringData() implementations together.

One is required to output char data, and the other is required to output wchar_t data. You could use #define macros to reduce the duplication of common string literals in code, but you still need to use the L prefix in the wchar_t implementation, and preferrably the u8 prefix in the char implementation (to ensure the data is actually UTF-8 and not compiler-defined), so you will still end up with separate strings in memory at runtime.

Even if you were to use a template to try to merge the two implementations, you would end up needing to use template specialization to separate the two output types.

You are best off just using the overloads you already have (possibly with #defines to reduce duplicates in code), or else perform a UTF conversion at runtime (which you wanted to avoid). In the latter case, you could reduce the overhead of your test runs by performing those conversions one time at app startup and caching the results for reuse.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
3

If you need only plain ASCII encoded as chars and wchar_ts, then you can do it with a function template (without specialization):

#include <iostream>
#include <map>
#include <string>
#include <utility>

template <typename StringType>
void generateStringData(std::map<StringType, StringType> &data) {
  static const std::pair<const char *, const char *> entries[] = {
    { "Lorem", "Lorem Ipsum is simply dummy text ..."},
    { "Ipsum", "Ipsum has been the industry's standard ..."}
  };
  for (const auto &entry : entries) {
    data.emplace(StringType(entry.first, entry.first + std::strlen(entry.first)),
                 StringType(entry.second, entry.second + std::strlen(entry.second)));
  }
}

int main() {
  std::map<std::string, std::string> ansi;
  generateStringData(ansi);
  std::map<std::wstring, std::wstring> wide;
  generateStringData(wide);

  std::cout << ansi["Lorem"] << std::endl;
  std::wcout << wide[L"Lorem"] << std::endl;
  return 0;
}

This works only because the wchar_t version of any ASCII character is just the ASCII value extended to 16 bits. If you had "interesting" characters in the source strings, this will not actually convert them to proper UTF-16.

Also note that you'll almost certainly end up with four copies of the strings in memory: two copies of the ASCII source strings in your executable (from the two instantiations of the function template), and the char and wchar_t copies in the heap.

But this might not be any worse than the preprocessor version. Using the preprocessor, you'll likely end up with both char and wchar_t versions in the executable as well as the char and wchar_t copies in the heap.

What the preprocessor approach can do is help you get around that big if at the top of this answer; with the preprocessor, you can use non-ASCII characters.

[Implementation note: Originally those assignments used std::begin(entry.first) and std::end(entry.first), but that included the string terminators as part of the string itself.]

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
  • This surprised me a lot...I thought that one can never initial a `wstring` using a `const char*` literal until I read this... – Saddle Point Jan 13 '18 at 10:29
  • 1
    @Edityouprofile: It's not actually initializing the `wstring` from `const char *`. It's copying individual character values from a range of `char`s and relying on implicit conversion to turn them into `wchar_t`s (or whatever the destination character type is). For true 7-bit ASCII values, the implicit conversion does the right thing. – Adrian McCarthy Jan 14 '18 at 01:30