Optimizing several million char* to string conversions

Question

I have an application that needs to take in several million char*'s as an input parameter (typically strings less than 512 characters (in unicode)), and convert and store them as .net strings.

It turning out to be a real bottleneck in the performance of my application. I'm wondering if there's some design pattern or ideas to make it more effecient.

There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.

For the record, here is the algorithm i'm using to convert char* to string (this algorithm is in C++, but the rest of the project is in C#)

String ^StringTools::MbCharToStr ( const char *Source ) 
{
   String ^str;

   if( (Source == NULL) || (Source[0] == '\0') )
   {
      str = gcnew String("");
   }
   else
   {
      // Find the number of UTF-16 characters needed to hold the
      // converted UTF-8 string, and allocate a buffer for them.
      const size_t max_strsize = 2048;

      int wstr_size = MultiByteToWideChar (CP_UTF8, 0L, Source, -1, NULL, 0);
      if (wstr_size < max_strsize)
      {
         // Save the malloc/free overhead if it's a reasonable size.
         // Plus, KJN was having fits with exceptions within exception logging due
         // to a corrupted heap.

         wchar_t wstr[max_strsize];

         (void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
         str = gcnew String (wstr);
      }
      else
      {
         wchar_t *wstr = (wchar_t *)calloc (wstr_size, sizeof(wchar_t));
         if (wstr == NULL) 
            throw gcnew PCSException (__FILE__, __LINE__, PCS_INSUF_MEMORY, MSG_SEVERE);

         // Convert the UTF-8 string into the UTF-16 buffer, construct the
         // result String from the UTF-16 buffer, and then free the buffer.

         (void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
         str = gcnew String ( wstr );
         free (wstr);
      }
   }
   return str;
}

That looks like C++/CLI or C++/CX rather than C++. I'm not changing the tag only because I don't know which. — bames53, Jan 14 '13 at 19:54
So you want to end up with only your 50 or so C# strings and a million references to them? — John Carter, Jan 14 '13 at 20:00
Its C++/CLI, and yes, I might have 1 million references, its a collection of tests over time. — greggorob64, Jan 14 '13 at 20:11

score 5 · Accepted Answer · edited May 23 '17 at 12:24

5

You could use each character from the input string to feed a trie structure. At the leaves, have a single .NET string object. Then, when a char* comes in that you've seen previously, you can quickly find the existing .NET version without allocating any memory.

Pseudo-code:

start with an empty trie,
process a char* by searching the trie until you can go no further
add nodes until your entire char* has been encoded as nodes
at the leaf, attach an actual .NET string

The answer to this other SO question should get you started: How to create a trie in c#

edited May 23 '17 at 12:24

Community

1
1

answered Jan 14 '13 at 19:54

jimbo

11,004
6
29
46

I think this will be a solid implementation that should work well. – greggorob64 Jan 14 '13 at 20:06

score 3 · Answer 2 · answered Jan 14 '13 at 19:53

3

There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.

If this is the case, you may want to consider storing the "found" patterns within a map (such as using a std::map<const char*, gcroot<String^>> [though you'll need a comparer for the const char*), and use that to return the previously converted value.

There is an overhead to storing the map, doing the comparison, etc. However, this may be mitigated by the dramatically reduced memory usage (you can reuse the managed string instances), as well as saving the memory allocations (calloc/free). Also, using malloc instead of calloc would likely be a (very small) improvement, as you don't need to zero out the memory prior to calling MultiByteToWideChar.

answered Jan 14 '13 at 19:53

Reed Copsey

554,122
78
1,158
1,373

I'll definitely switch over from malloc to calloc. The mapping sounds pretty similar to the tree implementation, but since I have access to .net datatypes (the C++ i meantioned was C++.net, not standard C++), I might be able to use their map types. – greggorob64 Jan 14 '13 at 20:04
@greggorob64 You won't be able to easily work with the .net collections with the native type as the key. Using `std::map` with the value being a `gcroot` will work without a custom built type, and give you the same `log(n)` access time as a trie. ;) – Reed Copsey Jan 14 '13 at 20:17
@Reed: tries are `O(1)` with respect to the number of strings, not `O(lg n)`. – Billy ONeal Jan 14 '13 at 20:24
@BillyONeal Oh, true. Probably doesn't matter with 50 inputs, but yeah, my mistake ;) – Reed Copsey Jan 14 '13 at 20:46

score 2 · Answer 3 · answered Jan 14 '13 at 20:00

I think the first optimization you could make here would be to make your first try calling MultiByteToWideChar start with a buffer instead of a null pointer. Because you specified CP_UTF8, MultiByteToWideChar must walk over the whole string to determine the expected length. If there is some length which is longer than the vast majority of your strings, you might consider optimistically allocating a buffer of that size on the stack; and if that fails, then going to dynamic allocation. That is, move the first branch if your if/else block outside of the if/else.

You might also save some time by calculating the length of the source string once and passing it in explicitly -- that way MultiByteToWideChar doesn't have to do a strlen every time you call it.

That said, it sounds like if the rest of your project is C#, you should use the .NET BCL class libraries designed to do this rather than having a side by side assembly in C++/CLI for the sole purpose of converting strings. That's what System.Text.Encoding is for.

I doubt any kind of caching data structure you could use here is going to make any significant difference.

Oh, and don't ignore the result of MultiByteToWideChar -- not only should you never cast anything to void, you've got undefined behavior in the event MultiByteToWideChar fails.

I'll look into the system.text.encoding namespaces. When we first started using .net, we just used the standard string contstuctor: new string(char* input). This crapped out rather quickly with wide characters, which is why found the implementation mentioned above and used that. The correct solution definately is using the libraries givent hough. — greggorob64, Jan 14 '13 at 20:08

score 1 · Answer 4 · answered Jan 14 '13 at 19:52

1

I would probably use a cache based on a ternary tree structure, or similar, and look up the input string to see if it's already converted before even converting a single character to .NET representation.

answered Jan 14 '13 at 19:52

500 - Internal Server Error

28,327
8
59
66

Optimizing several million char* to string conversions

4 Answers4