Tokenize string by space and assign multiple of tokenized of them into one string in C

Question

I have some data in a .csv file and each line belongs to one driver. What I want to do is to read each line of the CSV file and then split it into words. Generally each part one line data is separated from each other by ; . But the first part of each line is the driver's first name and last name which can have different formations. The rule I should observe in reading and splitting is to split this section into two sections. If there are three words, the last word should be assigned as last name and two others as the first name. I am using a struct to store each line's data as a driver and bellow is the code I wrote to read all file lines and create one struct object for each line.

void convert_file_data_to_struct(struct driver *driversList, int *driverCounter) {

    FILE *mfile;

    int errCnt;

    char line[100];

    char *tok = NULL;

    mfile = fopen(FILE_NAME, "r");

    if(!mfile) {

        printf("File wasn't read properly!\n");

        exit_program();

    }

    while(feof(mfile) == 0) {

        errCnt = 0;

        fgets(line, 99, mfile);

        int i, count;

        for (i=0, count=0; line[i]; i++)

            count += (line[i] == ' ');

        tok = strtok(line, " ");

        if (count > 1) {

            int spaceCounter = 0;

            char name[50];

            while (tok != NULL) {   

                if(spaceCounter <= count){

                    strcat(name, tok);

                    strcat(name, " ");

                    spaceCounter++;

                }

                if (spaceCounter <= count) {

                    tok = strtok(NULL, " "); 

                }  

            }            

        }

        strcpy((driversList + *driverCounter)->firstName, tok);

        while(tok != NULL) {

            ((errCnt+1) > 10) ? exit_program(): errCnt++;

            tok = strtok(NULL, ";");

            if(errCnt == 1) {

                strcpy((driversList + *driverCounter)->lastName, tok);

            } else {

                if(errCnt == 2) {

                    if(tok[0] == 'm' || tok[0] == 'f') {

                        strcpy((driversList + *driverCounter)->gender, tok);

                    } else {

                        exit_program();;

                    }

                } else if (errCnt == 3) {

                    if(atoi(tok)) {

                        (driversList + *driverCounter)->birthYear = atoi(tok);

                    }

                    else {

                        exit_program();

                    }

                }else if (errCnt == 4) {

                    strcpy((driversList + *driverCounter)->automobil, tok);

                } else if (errCnt == 5) {

                    if(atof(tok)) {

                        (driversList + *driverCounter)->firstRecord = atof(tok);

                    } else {

                        exit_program();

                    }

                } else if (errCnt == 6) {

                    if(atof(tok)) {

                        (driversList + *driverCounter)->secondRecord = atof(tok);

                    } else {

                        exit_program();

                    }

                } else if (errCnt == 7) {

                    if(atof(tok)) {

                        (driversList + *driverCounter)->thirdRecord = atof(tok);

                    } else {

                        exit_program();

                    }

                }else if (errCnt == 8){

                    if(atof(tok)) {

                        (driversList + *driverCounter)->fourthRecord = atof(tok);

                    } else {

                        exit_program();

                    }

                } else if (errCnt == 9) {

                    if(atof(tok)) {

                        (driversList + *driverCounter)->fifthRecord = atof(tok);

                    } else {

                        exit_program();

                    }

                }

            }

        }

        (*driverCounter)++;

    }

    fclose(mfile);

}

and bellow is csv file's each line data.

 Francoise Test Hardy-Test;f;1982;ferrari;72.643;71.987;70.221;79.002;73.737

The behavior of strtok() function is really ambiguous and I couldn't handle to get first name of the person in the way I mentioned before. I want to know if there is any other method in C to split string and that method returns split strings in index based.

Before going further, you will want to look at [**Why is while ( !feof (file) ) always wrong?**](https://stackoverflow.com/questions/5431941/why-is-while-feoffile-always-wrong). Instead use `while (fgets(line, 100, mfile) != NULL)` (**note:** there is no `- 1` in the size parameter with `fgets()`, it guarantees a *nul-terminated* string within that size). Additionally, please provide [A Minimal, Complete, and Verifiable Example (MCVE)](http://stackoverflow.com/help/mcve). — David C. Rankin, Nov 26 '20 at 05:03
This code would be a lot easier to read with the extraneous blank lines removed. — tadman, Nov 26 '20 at 05:05

David C. Rankin · Accepted Answer · 2020-11-26T07:18:13.987

Alright, let's see if we can get this sorted out. Continuing from the comment, you have now read and understand that using while(feof(mfile) == 0) will fail by attempting to read one-more line than is present in the file. Always control your read-loop with the return of the read function itself. (note: before you edited, I called your struct a car_tp (short for car type) -- so we will go with that) For example, if you have an array of car_tp with a max of MAXN elements, you could do:

#define MAXC   1024     /* if you need a constant, #define one (or more) */
#define MAXN     64
#define MAXY      8
#define DELIM ";\n"
...
int main (int argc, char **argv) {
    
    char buf[MAXC];                                     /* buffer to hold each line */
    car_tp cars[MAXN] = {{ .first = "" }};              /* array of MAXN struct */
    size_t ncars = 0;
    ...
    /* while array not full, read each line */
    while (ncars < MAXN && fgets (buf, MAXC, fp)) {
        ...

That will read each line into buf while protecting your struct array bounds.

Now on to tokenizing. You have the basic idea. You will have a token counter, and you will keep track of which token you are working on and assign that token to the proper struct member. The only real challenge is your first token, where you can have multiple space separate name-tokens of the name, where you want to assign only the last name-token to the last name and the rest to the first name.

Rather than counting spaces (which is valid, but awkward and error prone unless you also handle and consider multiple spaces a single separator), simply use strrchr() to find the last space in the token. You know the next character will point to the beginning of the last name, save it to your last name member. Then just loop toward the start until you find the next non-space. You know that will be the end of the first name. So all you need to do is copy from the beginning of the token to that point to your first name, and then nul-terminate the first name.

I prefer to use a switch() statement when determining which of multiple branches to take given a sequence of values rather than a long daisy-chain of if ... else if ... else and so on. You could do:

    /* while array not full, read each line */
    while (ncars < MAXN && fgets (buf, MAXC, fp)) {
        size_t ntok = 0;    /* token count */
        /* tokenize line */
        for (char *p = strtok (buf, DELIM); p; p = strtok (NULL, DELIM)) {
            switch (ntok) { /* switch to fill struct members */
            case 0: {       /* handle name (note brace enclosed block) */
                char *endp = strrchr (p, ' ');          /* find last space in token */
                if (!endp) {    /* validate */
                    fputs ("error: invalid field1 format.\n", stderr);
                    continue;
                }
                if (strlen (endp+1) >= MAXN) {          /* check last name fits */
                    fputs ("error: last too long.\n", stderr);
                    continue;
                }
                strcpy (cars[ncars].last, endp+1);      /* copy last name to struct */
                /* loop toward start, position endp to space after first name */
                while (endp != p && isspace (*(endp-1)))
                    endp--;
                if (endp - p >= MAXN) {                 /* check first name fits */
                    fputs ("error: first too long.\n", stderr);
                    continue;
                }
                memcpy (cars[ncars].first, p, endp - p);    /* copy first to struct */
                cars[ncars].first[endp - p] = 0;            /* nul-terminate */
                break;
            }

(note: after looping back toward the beginning of the token, you should check endp == p to catch the case where there is only 1-part to the entire name and handle the error -- that is left for you)

That does exactly what was described in the paragraph above, with the addition of validations of the length of each part of the name to make sure it fits in the MAXN number of characters available. From my guess at your struct, the rest of the string values can be handled as:

            case 1:     /* handle model */
                if (strlen (p) >= MAXN) {               /* check model fits */
                    fputs ("error: model too long.\n", stderr);
                    continue;
                }
                strcpy (cars[ncars].model, p);          /* copy to struct */
                break;
            ...
            case 3:
                if (strlen (p) >= MAXN) {               /* check manufacturer fits */
                    fputs ("error: manufacturer too long.\n", stderr);
                    continue;
                }
                strcpy (cars[ncars].manuf, p);          /* copy to struct */
                break;

For your floating-point values, you never want to use atof() in practice. It has ZERO error reporting and will happily return 0 for any non-numeric string you give it, such as atof ("my cow"). Always use strtof() or strtod() in practice with validation of both the digits converted and that errno was not set indicating under/overflow occurred during conversion. Since you will make multiple conversions, creating a simple function to handle the validation saves repeating the code for every floating-point value converted. For conversion to double you could do:

/* convert string 'nptr' to double stored at `dbl` with error-check,
 * returns 1 on success, 0 otherwise.
 */ 
int todouble (const char *nptr, double *dbl)
{
    char *endptr = (char*)nptr;         /* endptr to use with strtod() */
    errno = 0;                          /* zero errno */
    
    *dbl = strtod (nptr, &endptr);      /* convert to double, assign */
    if (endptr == nptr) {   /* validate digits converted */
        fputs ("error: todouble() no digits converted.\n", stderr);
        return 0;       /* return failure */
    }
    else if (errno) {   /* validate no under/overflow */
        fputs ("error: todouble() under/overflow detected.\n", stderr);
        return 0;       /* return failure */
    }
    
    return 1;   /* return success */
}

(note: that provides the minimal validation to ensure a valid conversion)

Then to convert and assign each of your floating-point values, you could do:

            case 4: {       /* handle double values (note brace enclosed block) */
                double d;
                if (!todouble (p, &d))                  /* attempt conversion */
                    continue;
                cars[ncars].d1 = d;                     /* assign on success */
                break;
            }

(note: in order to declare variables within a case statement, you must create a separate code-block by providing braces enclosing the code for that case because a switch() case statement isn't a separate program block (scope) all by itself)

That's basically all you need, aside from the counters tracking the number of elements in your struct array. Putting together a short example, you could do:

#include <stdio.h>
#include <stdlib.h>     /* for strtod() */
#include <string.h>     /* for strtok(), strrchr(), strlen() */
#include <ctype.h>      /* for isspace() */
#include <errno.h>      /* for errno */

#define MAXC   1024     /* if you need a constant, #define one (or more) */
#define MAXN     64
#define MAXY      8
#define DELIM ";\n"

typedef struct {        /* struct to fit data */
    char first[MAXN], last[MAXN], model[MAXN], year[MAXY], manuf[MAXN];
    double d1, d2, d3, d4, d5;
} car_tp;

/* convert string 'nptr' to double stored at `dbl` with error-check,
 * returns 1 on success, 0 otherwise.
 */ 
int todouble (const char *nptr, double *dbl)
{
    char *endptr = (char*)nptr;         /* endptr to use with strtod() */
    errno = 0;                          /* zero errno */
    
    *dbl = strtod (nptr, &endptr);      /* convert to double, assign */
    if (endptr == nptr) {   /* validate digits converted */
        fputs ("error: todouble() no digits converted.\n", stderr);
        return 0;       /* return failure */
    }
    else if (errno) {   /* validate no under/overflow */
        fputs ("error: todouble() under/overflow detected.\n", stderr);
        return 0;       /* return failure */
    }
    
    return 1;   /* return success */
}

/* simple print car function taking pointer to struct */
void prncar (car_tp *car)
{
    printf ("\nfirst : %s\n"
            "last  : %s\n"
            "model : %s\n"
            "year  : %s\n"
            "manuf : %s\n"
            "d1    : %lf\n"
            "d2    : %lf\n"
            "d3    : %lf\n"
            "d4    : %lf\n"
            "d5    : %lf\n", 
            car->first, car->last, car->model, car->year, car->manuf,
            car->d1, car->d2, car->d3, car->d4, car->d5);
}

int main (int argc, char **argv) {
    
    char buf[MAXC];                                     /* buffer to hold each line */
    car_tp cars[MAXN] = {{ .first = "" }};              /* array of MAXN struct */
    size_t ncars = 0;
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
    
    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }
    
    /* while array not full, read each line */
    while (ncars < MAXN && fgets (buf, MAXC, fp)) {
        size_t ntok = 0;    /* token count */
        /* tokenize line */
        for (char *p = strtok (buf, DELIM); p; p = strtok (NULL, DELIM)) {
            switch (ntok) { /* switch to fill struct members */
            case 0: {       /* handle name (note brace enclosed block) */
                char *endp = strrchr (p, ' ');          /* find last space in token */
                if (!endp) {    /* validate */
                    fputs ("error: invalid field1 format.\n", stderr);
                    continue;
                }
                if (strlen (endp+1) >= MAXN) {          /* check last name fits */
                    fputs ("error: last too long.\n", stderr);
                    continue;
                }
                strcpy (cars[ncars].last, endp+1);      /* copy last name to struct */
                /* loop toward start, position endp to space after first name */
                while (endp != p && isspace (*(endp-1)))
                    endp--;
                if (endp - p >= MAXN) {                 /* check first name fits */
                    fputs ("error: first too long.\n", stderr);
                    continue;
                }
                memcpy (cars[ncars].first, p, endp - p);    /* copy first to struct */
                cars[ncars].first[endp - p] = 0;            /* nul-terminate */
                break;
            }
            case 1:     /* handle model */
                if (strlen (p) >= MAXN) {               /* check model fits */
                    fputs ("error: model too long.\n", stderr);
                    continue;
                }
                strcpy (cars[ncars].model, p);          /* copy to struct */
                break;
            case 2:
                if (strlen (p) >= MAXY) {               /* check year fits */
                    fputs ("error: year too long.\n", stderr);
                    continue;
                }
                strcpy (cars[ncars].year, p);           /* copy to struct */
                break;
            case 3:
                if (strlen (p) >= MAXN) {               /* check manufacturer fits */
                    fputs ("error: manufacturer too long.\n", stderr);
                    continue;
                }
                strcpy (cars[ncars].manuf, p);          /* copy to struct */
                break;
            case 4: {       /* handle double values (note brace enclosed block) */
                double d;
                if (!todouble (p, &d))                  /* attempt conversion */
                    continue;
                cars[ncars].d1 = d;                     /* assign on success */
                break;
            }
            case 5: {                           /* ditto */
                double d;
                if (!todouble (p, &d))
                    continue;
                cars[ncars].d2 = d;
                break;
            }
            case 6: {                           /* ditto */
                double d;
                if (!todouble (p, &d))
                    continue;
                cars[ncars].d3 = d;
                break;
            }
            case 7: {                           /* ditto */
                double d;
                if (!todouble (p, &d))
                    continue;
                cars[ncars].d4 = d;
                break;
            }
            case 8: {                           /* ditto */
                double d;
                if (!todouble (p, &d))
                    continue;
                cars[ncars].d5 = d;
                break;
            }   /* provide default case to notify on too many tokens */
            default: fputs ("error: tokens exceed struct members.\n", stderr);
                break;
            }
            ntok += 1;                                  /* update token count */
        }
        ncars += 1;                                     /* update car count */
    }
    
    if (fp != stdin)   /* close file if not stdin */
        fclose (fp);
    
    for (size_t i = 0; i < ncars; i++)                  /* output results */
        prncar (&cars[i]);
    
    return 0;
}

Example Input File

To test the code, I created a file from your sample input adding another record to exercise the simple first-name/last-name case, e.g.

$ cat dat/ferrari.txt
Francoise Test Hardy-Test;f;1982;ferrari;72.643;71.987;70.221;79.002;73.737
Other Guys;Dino;1980;ferrari;72.643;71.987;70.221;79.002;73.737

Example Use/Output

With the prncar() function included, it would produce the following output:

$ ./bin/cartok dat/ferrari.txt

first : Francoise Test
last  : Hardy-Test
model : f
year  : 1982
manuf : ferrari
d1    : 72.643000
d2    : 71.987000
d3    : 70.221000
d4    : 79.002000
d5    : 73.737000

first : Other
last  : Guys
model : Dino
year  : 1980
manuf : ferrari
d1    : 72.643000
d2    : 71.987000
d3    : 70.221000
d4    : 79.002000
d5    : 73.737000

There is quite a bit to digest, so take your time to understand each line of code. If you have questions, just drop a comment below and I'm happy to help further.

Tokenize string by space and assign multiple of tokenized of them into one string in C

1 Answers1