0

I am trying to write a basic CSV parser in C that generates a dynamic array of char* when given a char* and a separator character, such as a comma:

char **filldoc_parse_csv(char *toparse, char sepchar)
{
    char **strings = NULL;

    char *buffer = NULL;

    int j = 0;
    int k = 1;

    for(int i=0; i < strlen(toparse); i++)
    {
        if(toparse[i] != sepchar)
        {
            buffer = realloc(buffer, sizeof(char)*k);
            strcat(buffer, (const char*)toparse[i]);
            k++;
        }
        else
        {
            strings = realloc(strings, sizeof(buffer)+1);

            strings[j] = buffer;

            free(buffer);

            j++;

        }
    }

    return strings;


}

However, when I call the function as in the manner below:

char **strings = filldoc_parse_csv("hello,how,are,you", ',');

I end up with a segmentation fault:

Program received signal SIGSEGV, Segmentation fault.
__strcat_sse2 () at ../sysdeps/x86_64/multiarch/../strcat.S:166
166 ../sysdeps/x86_64/multiarch/../strcat.S: No such file or directory.
(gdb) backtrace
#0  __strcat_sse2 () at ../sysdeps/x86_64/multiarch/../strcat.S:166
#1  0x000000000040072c in filldoc_parse_csv (toparse=0x400824 "hello,how,are,you", sepchar=44 ',') at filldocparse.c:20
#2  0x0000000000400674 in main () at parsetest.c:6

The problem is centered around allocating enough space for the buffer string. If I have to, I will make buffer a static array, however, I would like to use dynamic memory allocation for this purpose. How can I do it correctly?

Igor
  • 548
  • 3
  • 7
  • 24
  • 2
    Personally I would be using `strtok` to do this, as splitting strings is why it exists in the first place. See http://stackoverflow.com/questions/3889992/how-does-strtok-split-the-string-into-tokens-in-c – Peter M Jan 01 '15 at 21:57
  • 3
    CSV files are *deceptively* simple to parse, and I mean deceptively in the very worst case because there are so many corner-cases and other things that needs special handling. – Some programmer dude Jan 01 '15 at 21:59
  • You can just change the separator in *scanf() functions from white space to commas or whatever else you wish. strsep() and strtok() have some nasty side effects that you may wish to avoid. – technosaurus Jan 01 '15 at 22:07
  • @technosaurus I wouldn't call the side-effects of [`strtok`](http://en.cppreference.com/w/c/string/byte/strtok) *nasty*, one just have to know that it will modify the source string and that it can't handle empty fields. – Some programmer dude Jan 01 '15 at 22:13
  • Just make a backup of the input string that `strtok` will be working with. You'll get a segfault if you're passed a readonly string. I put an example in the answers below. – Cloud Jan 01 '15 at 22:30

2 Answers2

2

You have problems with your memory allocations. When you do e.g. sizeof(buffer) you will get the size of the pointer and not what it points to. That means you will in the first run allocate five bytes (on a 32-bit system), and the next time the function is called you will allocate five bytes again.

There are also many other problems, like you freeing the buffer pointer once you assigned the pointer to strings[j]. The problem with this is that the assignment only copies the pointer and not what it points to, so by freeing buffer you also free strings[j].

Both the above problems will lead to your program having undefined behavior, which is the most common cause of runtime-crashes.

You should also avoid assigning the result of realloc to the pointer you're trying to reallocate, because if realloc fails it will return NULL and you loose the original pointer causing a memory leak.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
2

Various problems

  1. strcat(buffer, (const char*)toparse[i]); attempts to changes a char to a string.

  2. strings = realloc(strings, sizeof(buffer)+1); reallocates the same amount of space. sizeof(buffer) is the size of the pointer buffer, not the size of memory it points to.

  3. The calling function has no way to know how many entries in strings. Suggest andding a NULL sentinel.

Minor: better to use size_t rather than int. Use more descriptive names. Do not re-call strlen(toparse) repetitively. Use for(int i=0; toparse[i]; i++) . Make toparse a const char *

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256