2

I'm writing a function that is supposed to read a string of numbers, separated by commas. The format of the string is as following:

"1, 2, 3"

The only "rule" is that the function will tolerate any spaces or tabs, as long as there's one comma between each number.

If the string is valid, the numbers are to be stored in a linked list.

for instance, the following strings are valid:

"1,2,14,2,80"
"  250  ,  1,  88"

but the following are not valid:

" 5, 1, 3 ,"
"51, 60, 5,,9"

I first tried my luck with strtok() (using delimiters ", \t", but from what i understand at the moment, it's impossible to check for errors. So I wrote my own function, but I'm extremely unhappy with it - I think the code is pretty bad, and although it seems to work, I'd really like to know if there is a cleaner, easier way to implement such a function.

My function is:

void sliceNumbers(char * string)
{
  /*flag which marks if we're expecting a comma or not*/
  int comma = FALSE;
  /*Are we inside a number?*/
  int nFlag = TRUE;
  /*error flag*/
  int error = FALSE;
  /*pointer to string start*/
  char * pStart = string;
  /*pointer to string end*/
  char * pEnd = pStart;

  /*if received string is null*/
  if (!string)
  {
    /*add error and exit function*/
    printf("You must specify numbers");
    return;
  }
  /*this loop checks if all characters in the string are legal*/
  while (*pStart != '\0')
  {
    if ((isdigit(*pStart)) || (*pStart == ',') || (*pStart == ' ') || (*pStart == '\t'))
    {
      pStart++;
    }
    else
    {
      char tmp[2];
      tmp[0] = *pStart;
      tmp[1] = 0;
      printf("Invalid character");
      error = TRUE;
      pStart++;
    }
  }
  if (!error)
  {
    pStart = string;
    if (*pStart == ',')
    {
    printf("Cannot start data list with a comma");
    return;
    }
    pEnd = pStart;
    while (*pEnd != '\0')
    {
      if (comma)
      {
        if (*pEnd == ',')
        {
          if (!nFlag)
          {

          }
          if (*(pEnd + 1) == '\0')
          {
            printf("Too many commas");
            return;
          }
          *pEnd = '\0';
          /*Add the number to the linked list*/
          addNumber(pStart, line, DC);
          comma = FALSE;
          nFlag = FALSE;
          pStart = pEnd;
          pStart++;
          pEnd = pStart;
        }
        else if (isdigit(*pEnd))
        {
          if (!nFlag)
          {
            printf("numbers must be seperated by commas");
            pEnd++;
          }
          else
          {
            if (*(pEnd + 1) == '\0')
            {
              pEnd++;
              /*Add the number to the linked list*/
              addNumber(pStart);
              comma = FALSE;
              nFlag = FALSE;
              pStart = pEnd;
              pStart++;
              pEnd = pStart;
            }
            else
            {
              pEnd++;
            }
          }
        }
        else if (*pEnd == '\0')
        {
          if (nFlag)
          {
            /*Add the number to the linked list*/
            addNumber(pStart, line, DC);
          }
          else
          {
            printf("Too many commas");
          }

        }
        else if (*pEnd == ' ' || *pEnd == '\t')
        {
          nFlag = FALSE;
          pEnd++;
        }
      }
      else
      {
        if (*pEnd == ',')
        {
          printf("There must be only 1 comma between numbers");
          return;

        }
        else if (isdigit(*pEnd))
        {
          if (*(pEnd + 1) == '\0')
          {
            pEnd++;
            /*Add the number to the linked list*/
            addnumber(pStart, line, DC);
            comma = FALSE;
            nFlag = FALSE;
            pStart = pEnd;
            pStart++;
            pEnd = pStart;
          }
          else
          {
            pStart = pEnd;
            pEnd++;
            nFlag = TRUE;
            comma = TRUE;
          }
        }
        else if (*pEnd == ' ' || *pEnd == '\t')
        {
          if (!nFlag)
          {
            pEnd++;
          }
          else
          {
            pEnd++;
          }
        }
      }
    }
  }
}
dbc
  • 104,963
  • 20
  • 228
  • 340
fishamit
  • 257
  • 2
  • 10
  • Take a look to [strsep](https://www.gnu.org/software/libc/manual/html_node/Finding-Tokens-in-a-String.html) – David Ranieri Mar 13 '17 at 09:51
  • You tell us what the program should do if the input is valid, but not what to do if it is not valid. I can think of two ways this problem could have been set for you: "Your program must detect invalid input and report an error" or "The input is guaranteed to be valid, so your program doesn't need to handle it". This will affect our answers. – slim Mar 13 '17 at 10:23
  • it should not be a problem to check the return value of `strtok` to determine whether a line is valid or not. if the token returned has length zero, it is invalid if not, add to array by doing `atoi()` on the token. – AndersK Mar 13 '17 at 11:19
  • Construct a state machine. You have ~3 token types and ~5 states. – joop Mar 13 '17 at 11:34

4 Answers4

3

You have defined a number of booleans (although you've declared them as ints) that keep track of the current state. You could combine these into one state variable, using #define to define possible values:

#define STATE_START 0
#define STATE_IN_NUMBER 1
#define STATE_COMMA 2
#define STATE_FINISHED 3
#define STATE_ERROR 4

int state = STATE_START;

You can draw a diagram (a bit like a flow chart) showing how each character moves us from one state to another.

enter image description here

(For my image I've kept things simple and only shown the non-error states for input with no spaces)

Or just put it in words:

current state   | input     | next state| side effect
-----------------------------------------------------------------------
START           | digit     | IN_NUMBER | start storing a number
START           | other     | ERROR     | 
IN_NUMBER       | digit     | IN_NUMBER | continue storing a number
IN_NUMBER       | comma     | COMMA     | complete storing a number
IN_NUMBER       | null      | FINISHED  | finalise output
IN_NUMBER       | other     | ERROR     | report error
COMMA           | digit     | IN_NUMBER | start storing a number
COMMA           | comma     | ERROR     |
COMMA           | other     | ERROR     |

(For my table I've added basic error states, but still haven't accounted for whitespace)

You will need to add some more states and transitions to deal with spaces and tabs, but the principle doesn't change. I'd suggest starting with an implementation that works without spaces, then adding to it.

This allows you to write a finite state machine, one implementation of which looks like this:

int state = STATE_START;
while(state != STATE_FINISHED && state != STATE_ERROR) {
    char c = input[offset++];
    switch(state) {
        case STATE_START:
            state = handleStateStart(...);
            break;
        case STATE_IN_NUMBER:
            state = handleInNumber(...);
            break;
        // etc.
        default:
            sprintf(error_message, "Reached unsupported state: %c", state);
            state = STATE_ERROR;
    }
}

The parameters to the handling functions need to pass in the data structures it will be reading and modifying. For example:

int handleStateStart(
    char c,
    int* current_number,
    char *error_message) 
{
    if( ! isDigit(c)) {
        sprintf(error_message, "Expected a digit at char %d", *offset);
        return STATE_ERROR;
    }
    *current_number = atoi(c);
    return STATE_IN_NUMBER;
}

(This is an easy-to-understand way of implementing a state machine, but there are other ways to do it: Is there a typical state machine implementation pattern? )

Your CSV parsing problem lends itself very well to a state machine, and the resulting code will be very neat and clean. State machines are used for more complex parsing tasks, and are heavily used in things like compilers. Later in your study you will encounter regular expressions -- formally, regular expressions are a compact way of expressing finite state machines that consume characters.

Community
  • 1
  • 1
slim
  • 40,215
  • 13
  • 94
  • 127
  • You probably need to handle whitespace, too. I think: `1, 2 3, 4, 5` should not be considered a valid string. – joop Mar 13 '17 at 12:17
  • @joop "You will need to add some more states and transitions to deal with spaces and tabs, but the principle doesn't change" – slim Mar 13 '17 at 13:10
  • You might simplify `handleStateStart()` by passing a struct containing all that working data -- but I decided not to assume knowledge of structs. – slim Mar 13 '17 at 13:37
  • Also note I've used `sprintf`. In real code use `snprintf`. – slim Mar 13 '17 at 14:00
1

strtok() is the right way to do that. But pass "," (comma) only as delimiter. You can check the resulting string for being zero-length (strlen(tok)==0), which means that you have two consecutive ','. After checking this you simply trim the results, i. e. as described here.

Community
  • 1
  • 1
Amin Negm-Awad
  • 16,582
  • 3
  • 35
  • 50
  • 1
    From 'man strtok' A sequence of two or more contiguous delimiter characters in the parsed string is considered to be a single delimiter. Delimiter characters at the start or end of the string are ignored. Put another way: the tokens returned by strtok() are always non-empty strings. – dmuir Mar 13 '17 at 09:58
  • I assume this is a homework question, so the aim is to write an algorithm, not use a library function. – slim Mar 13 '17 at 10:20
  • 1
    @slim Then why would the OP mention `strtok()`, then say the only problem he/she has with it is because of error checking? Don't think OP has a problem with using library functions for his *"algorithm"*. – RoadRunner Mar 13 '17 at 10:46
  • @dmuir Thanks for the hint. However, checking the first character shouldn't be a big deal. – Amin Negm-Awad Mar 13 '17 at 11:34
  • @AminNegm-Awad My point was that the OP says consecutive delimiters are an error, and that you can't detect this using strtok – dmuir Mar 13 '17 at 17:20
  • @dmuir I understood that. After finding a token he can simply check the very next character for being a delimiter. I understood him that way, that one `','` is a delimiter, therefore it has to be `",,"` as in his example. – Amin Negm-Awad Mar 13 '17 at 20:00
0

You can use regex lib

1) Validate string

[^\d, ]|,[[:blank:]]+,|,{2,}

where
[^\d, ] - found all symbols except digits, commas and white spaces
,[[:blank:]]+,|,{2,} - Validate string 2 or more commas with spaces and tabs without digits between commas

2) Process numbers

\d+

You can try it online here

Denis P.
  • 302
  • 1
  • 2
  • 8
0

A quite efficient straight forward approach:

  1. Remove all the spaces and tabs at one pass. You can avoid the space overhead by doing it in place.
  2. Read the stream of numbers and keep adding them to a linked list. If any invalid number, for example, of length 0, would be detected, simply return a NULL pointer and thus stop processing further.
  3. If pass 2 successfully completed, return the head pointer of that linked list.
Fallen
  • 4,435
  • 2
  • 26
  • 46