1

Here's some code I found in a very old C library that's trying to eat whitespace from a file...

  while(
    (line_buf[++line_idx] != ' ')  &&
    (line_buf[  line_idx] != '  ') &&
    (line_buf[  line_idx] != ',')  &&
    (line_buf[  line_idx] != '\0') )
  {

This great thread explains what the problem is, but most of the answers are "just ignore it" or "you should never do this". What I don't see, however, is the canonical solution. Can anyone offer a way to code this test using the "proper way"?

UPDATE: to clarify, the question is "what is the proper way to test for the presence of a string of one or more characters at a given index in another string". Forgive me if I am using the wrong terminology.

Maury Markowitz
  • 9,082
  • 11
  • 46
  • 98
  • Are you sure this "eats" whitespace? Looks like it eats non-space. – chux - Reinstate Monica Feb 28 '18 at 00:07
  • 2
    If the code is accessing a single byte, I don't see how the second condition, which uses a multi-char literal, could ever be true. Is line_buf a pointer to char? – Jonathon Reinhart Feb 28 '18 at 00:08
  • "Eating" _whitespace_ is more like `while (isspace(line_buf[line_idx]) line_idx++;` – chux - Reinstate Monica Feb 28 '18 at 00:10
  • The canonical solution to what? You can *probably* just delete the second condition, since it's not clear what it was ever intended to do, and it's reasonably clear that it will never, ever fire. – Steve Summit Feb 28 '18 at 00:11
  • Use caution when cutting out magic. That code exists because at some time in the past (and possibly in the present) that value existed in `line_buf[]`. Figure out what the original problem was and fix _that_. Don’t just remove code you think is misplaced. – Dúthomhas Feb 28 '18 at 00:46
  • @Dúthomhas: Not necessarily. The person who wrote that line of code might not have had a valid reason for it. It could have been a copy-and-paste error, for example. – Keith Thompson Feb 28 '18 at 01:06
  • @JonathonReinhart: It *could* be true because the value of a multi-character constant is implementation defined. If I use tcc, the value of `' '` (two spaces) is the same as the value of `' '` (one space). (Those aren't being rendered correctly in this comment, but you get the idea.) – Keith Thompson Feb 28 '18 at 01:06
  • 3
    @TianjiaoHuang: Actually, the `&&` and `||` operators are different and do enforce order. In your link look at Rules #2. – Zan Lynx Feb 28 '18 at 02:23
  • 1
    @KeithThompson It is a very dangerous philosophy to look at old code and assume that any one piece of it is a mistake. – Dúthomhas Feb 28 '18 at 02:43
  • @Dúthomhas: In this case, I disagree. If `line_buf` is an array of `char`, there is no way that comparing an element to `' '` (2 spaces) could make sense. It's equally dangerous to assume that old code *isn't* a mistake. It could be a typo, or a copy-and-paste error, or the author might have wrongly thought that the code would check for two spaces (which it absolutely does not do). – Keith Thompson Feb 28 '18 at 03:21
  • 1
    @JonathonReinhart, probably the second case is not a space, but a literal tab character, not encoded as `\t`, but as a literal tab between two quote chars. – Luis Colorado Feb 28 '18 at 08:51
  • @LuisColorado - I believe you are correct. It is two literal spaces in the code I have, but that might have been something that happened anywhere along the ling. Assuming that is the case, is replacing it with \t the "best solution"? – Maury Markowitz Feb 28 '18 at 12:22

2 Answers2

1

Original question

There is no canonical or correct way. Multi-character constants have always been implementation defined. Look up the documentation for the compiler used when the code was written and figure out what was meant.

Updated question

You can match multiple characters using strchr().

while (strchr( " ,", line_buf[++line_idx] ))
{

Again, this does not account for that multi-char constant. You should figure out why that was there before simply removing it.

Also, strchr() does not handle Unicode. If you are dealing with a UTF-8 stream, for example, you will need a function capable of handling it.

Finally, if you are concerned about speed, profile. The compiler might get you better results using the three (or four) individual test expressions in the ‘while’ condition.

In other words, the multiple tests might be the best solution!

Beyond that, I smell some uncouth indexing: the way that line_idx is updated depends on the surrounding code to actuate the loop properly. Make sure that you don’t create an off-by-one error when you update stuff.

Good luck!

Dúthomhas
  • 8,200
  • 2
  • 17
  • 39
  • Speed is not an issue, the largest file I've seen, which was WAY on the long tail, was around 90k. This is pure ASCII, or at least its *supposed* to be, so strchr seems like the right solution. – Maury Markowitz Feb 28 '18 at 12:20
1

UPDATE: to clarify, the question is "what is the proper way to test for the presence of a string of one or more characters at a given index in another string". Forgive me if I am using the wrong terminology.

Well, there are a number of ways, but the standard way is using strspn which has the prototype:

size_t strspn(const char *s, const char *accept);

and it cleverly:

calculates the length (in bytes) of the initial segment of s 
which consists entirely of bytes in accept.

This allows you to test for the "the presence of a string of one or more characters at a given index in another string" and tells you how many of the characters from that string were sequentially matched.

For example, if you had another string say char s = "somestring"; and wanted to know if it contained the letters r, s, t, say, in char *accept = "rst"; beginning at the 5th character, you could test:

size_t n;
if ((n = strspn (&s[4], accept)) > 0)
    printf ("matched %zu chars from '%s' at beginning of '%s'\n",
           n, accept, &s[4]);

To compare in order, you can use strncmp (&s[4], accept, strlen (accept));. You can also simply use nestest loops to iterate over s with the characters in accept.

All of the ways are "proper", so long as they do not invoke Undefined Behavior (and are reasonable efficient).

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • So the difference is that this will return 2 for the test "GW", whereas strchr would return (say) 11, the location where it starts. Yes, this seems like the more useful solution. – Maury Markowitz Feb 28 '18 at 13:01
  • yes, because it will not skip non-matching characters -- which is why you give it the *address within* the string you wish to test. – David C. Rankin Feb 28 '18 at 14:09