0

I have been attempting to parse a string into separate tokens for a command line interface for a project of mine, I have created this function to do it:

char **string_parser(char *input) {
    char **output = (char **) malloc(sizeof(input));
    int word_num = 0;
    int word_index = 0;

    for(int i = 0; i < strlen(input); i++) {
        if(input[i] == ' ') {
            output[word_num][word_index] = '\0';
            word_index = 0;
            word_num++;
            continue;
        }

        if(input[i] == '\0') {
            output[word_num][word_index] = '\0';
            break;
        }

        output[word_num][word_index] = input[i];
        word_index++;
    }

    return output;
}

but it segmentation faults after 1 iteration,

I have been calling the function on: char *input = "this is a parser test."; any help is greatly appreciated.

Vlad from Moscow
  • 301,070
  • 26
  • 186
  • 335
  • You'll need to compute `word_num` before you allocate. You can't just start jamming in random stuff into an improperly allocated array. This might require a two-pass approach: First to calculate, then to allocate and copy. – tadman Jun 18 '23 at 19:37
  • PSA: This is just a really clunky implementation of `strtok`, an approach you probably want to use instead since it involves zero copying. – tadman Jun 18 '23 at 19:37

2 Answers2

1

This memory allocation

char **output = (char **) malloc(sizeof(input));

is incorrect. It allocates memory only for one object of the type char *.

Moreover the allocated memory is uninitialized. So at least this statement (and similar statements)

output[word_num][word_index] = '\0';

invokes undefined behavior.

You need to allocate as many pointers as there are words in the source string. And for each word you also need to allocate a character array to store the extracted word.

And calling the function strlen in the for loop

for(int i = 0; i < strlen(input); i++) {

is redundant and inefficient.

Pay attention to that the function parameter should be declared with qualifier const because the source string is not changed within the function.

Here is a demonstration program that shows an approach to the function implementation. The function has drawbacks because it does not check that memory allocations were successfull. You will need to do that yourself. The last pointer in the allocated array of pointers is set to NULL to allow to determine the number of valid extracted words.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char ** string_parser( const char *input ) 
{
    char **output = NULL;
    size_t word_num = 0;

    for (const char *delim = " \t"; *input != '\0'; )
    {
        input += strspn( input, delim );

        if (*input)
        {
            size_t n = strcspn( input, delim );

            output = realloc( output, ( word_num + 1 ) * sizeof( char * ) );

            output[word_num] = malloc( n + 1 );

            memcpy( output[word_num], input, n );
            output[word_num][n] = '\0';
            ++word_num;

            input += n;
        }
    }

    output = realloc( output, ( word_num + 1 ) * sizeof( char * ) );

    output[word_num] = NULL;

    return output;
}

int main( void )
{
    const char *input = "this is a parser test.";

    char **output = string_parser( input );

    for (char **p = output; *p != NULL; ++p)
    {
        puts( *p );
    }

    for (char **p = output; *p != NULL; ++p)
    {
        free( *p );
    }
    free( output );
}

The program output is

this
is
a
parser
test.
Vlad from Moscow
  • 301,070
  • 26
  • 186
  • 335
0

First of all, I suggest that you check the common approaches to parsing command line parameters in C programs. For example, searching StackOverflow -- Parsing command-line arguments in C

Next, concerning the bugs in your code, you malloc() some bytes and then treat them as pointers to some memory addresses when writing to these arrays at line

output[word_num][word_index] = '\0';

But as the memory allocated by malloc() contains random bytes, they make up random addresses and an attempt to write to these addresses causes segmentation fault.

I guess the idea behind your code is to split the string into words. To do that, I suggest using a standard function. If you absolutely must implement your own tokenizer, you have to think carefully of how the calling code would free up the memory allocated for tokens by the tokenizer.

Alexey Adadurov
  • 174
  • 1
  • 5