5

I need to search through a chunk of memory for a string of characters, but several of these strings have every character null separated, like this:
"I. .a.m. .a. .s.t.r.i.n.g"
with all of the '.'s being null characters. My problem comes from actually getting this into memory. I've tried several ways, for instance:

 char* str2; 
 str2 = (char*)malloc(sizeof(char)*40);   
 memcpy((void*)str2, "123\0567\09abc", 12);    

Will put the following into the memory that str2 points to: 123.7.9abc..
Something like
str2 = "123456789\0abcde\054321";
Will have str2 pointing to a block of memory that looks like 123456789.abcde,321 , wherein the '.' is a null character, and the ',' is an actual comma.

So clearly inserting null characters into cstrings doesn't work as easily as I thought it did, like inserting a newline character. I encountered similar difficulties trying this with the string library as well. I could do separate assignments, something like:

 char* str;    
 str = (char*)malloc(sizeof(char)*40);  
 strcpy(str, "123");  
 strcpy(str+4, "abc");  
 strcpy(str+8, "ABC");  

But that is certainly not preferable, and I believe the problem lies in my understanding of how c-style strings are stored in memory. Clearly "abc\0123" doesn't actually go into memory as 61 62 63 00 31 32 33 (in hex). How is it stored, and how can I store what I need to?

(I also apologize for not having set the code in blocks, this is my first time posting a question, and somehow "four spaced" is more difficult than I can handle apparently. Thank you, Luchian. I see more newlines were needed.)

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
Fulluphigh
  • 506
  • 1
  • 8
  • 21

4 Answers4

6

If every other char contains a null, then almost certainly you actually have UTF-16 encoded strings. Process them accordingly and your problems will disappear.

Assuming you are on Windows, where UTF-16 is common, you would use wchar_t* rather than char* to hold such strings. And you would use wide char string processing functions to operate on such data. For example, use wcscpy rather than strcpy and so on.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • This definitely solved my practical problem, and lets me implement the search easily. This is the first time I've programmed on windows, and I hadn't encountered wide characters before, so when looking at the binary dumps that never occurred to me. And I kept thinking how weird it was to null-terminate every character... =P Refp below also helped me figure out what was causing the problem I was having doing it the old way. – Fulluphigh Jun 13 '12 at 20:34
3

\0 is the starting sequence of an escaped character in octets, it's not just a "null character" (even though the use of it's own will result in one).


The easiest way to define a string containing a null-character followed by something that could also be treated as a part of an escaped characer in octet (such as "\012"1) is to split it up using this below feature of C:

char const * p = "123456789" "\0" "abcde" "\0" "54321";

1. "\012" will result in the character with the equivalent hex value of 0x0A, not three characters; 0x00, '1' and '2'.

Filip Roséen - refp
  • 62,493
  • 20
  • 150
  • 196
  • Excellent answer. David's answer above works perfectly for what I need to do, but I still didn't know what was causing this behavior. I was unaware of octets being something you could even escape in this manner. Thank you, excellent answer. I don't have enough rep to upvote, but yeah. – Fulluphigh Jun 13 '12 at 20:31
  • @Joshua It's a good idea to familiarize yourself with the basic syntax of your programming language. Here's a useful reference for C: http://ieng9.ucsd.edu/~cs30x/Std.C/syntax.html – Jim Balter Jun 13 '12 at 22:07
2

First off, every second character being a NULL is a clear hallmark of a widestring - a string that's composed of two-byte characters, really an array of unsigned shorts. Depending on your compiler and settings, you might be better off using datatype wchar_t instead of char and wcsxxx() family of functions instead of strxxx().

On Windows, 2-byte widestrings (UTF-16, technically) is the native string format of the OS, so they're all around the place.

That said, strxxx() functions all assume that the string is null-terminated. So plan accordingly. Sometimes memxxx() will come to the rescue.

"abc\0123" does not go into memory the way you expect because \012 is being interpreted by the compiler as a single octal escape sequence - the character with octal code 12 (that's 0a hex). To avoid, use one of the following literals:

"abc\000123"
"abc\x00123"
"abc\0""123"

The snippet where you generate a string from chunks is mostly correct. It's just that I'd rather use

strcpy(str+strlen(str)+1, "123");

that guarantees that the next chunk will be written past the null character of the previous chunk.

Seva Alekseyev
  • 59,826
  • 25
  • 160
  • 281
  • 2
    \054 is being interpreted by the compiler as a single octal escape sequence. Octal 54 is hex 2c. To make a true string with an embedded null, use the following literal: "abcde\00054321" or "abcde\0""54321", or "abcde\x0054321". The escape sequence parser is matching greedily, obviously. Stick to string literals that don't allow for ambiguous interpretation. – Seva Alekseyev Jun 13 '12 at 20:29
0

I am a bit confused by your question. But let me guess what is going on. You are looking at 16 bit wchat_t string and not a normal c string. wchar getting ascii characters may look like null separated between letters but actually this is normal.

simply (wchar_t *)XXX where XXX is a pointer to that region of memory and lookup wchar_t operations like wcscpy etc... as for the nulls between strings, this may actually be a known method to pass multiple string construct. You can simply iterate after your read each string until normally you encounter 2 consecutive nulls.

Hope I have answered your question. Good luck!

Tzahi Fadida
  • 198
  • 1
  • 7