Splitting string in C by blank spaces, besides when said blank space is within a set of quotes

Question

I'm writing a simple Lisp in C without any external dependencies (please do not link the BuildYourOwnLisp), and I'm following this guide as a basis to parse the Lisp. It describes two steps in tokenising a S-exp, those steps being:

Put spaces around every paranthesis
Split on white space

The first step is easy enough, I wrote a trivial function that replaces certain substrings with other substrings, but I'm having problems with the second step. In the article it only uses the string "Lisp" in its examples of S-exps; if I were to use strtok() to blindly split by whitespace, any string in a S-exp that had a space within it would become fragmented and interpreted incorrectly by my Lisp. Obviously, a language limited to single-word strings isn't very useful.

How would I write a function that splits a string by white space, besides when the text is in between two double quotes?

I've tried using regex, but from what I can see of the POSIX regex.h library and PCRE, just extracting the matches would be incredibly laborious in terms of the amount of auxillary code I'd have to write, which would only serve to bloat my codebase. Besides, one of my goals with this project was to use only ANSI C, or, if need be, C99, solely for the sake of portability - fiddling with the POSIX library and the Win32 API would just fatten my code and make moving my lisp around a nightmare.

When researching this problem I came across this StackOverflow answer; but the approved answer only sends the tokenised string onto stdout, which isn't useful for me; I'd ideally have the tokens in a char** so that I could then parse them into useful in memory data structures.

As well as this, the approved answer on the aforementioned SO question is written to be restricted to specifically my problem - ideally, I'd have myself a general purpose function that would allow me to tokenise a string, except when a substring is between two of charachter x. This isn't a huge deal, it's just that I'd like my codebase to be clean and composable.

Even for a simple lisp-like language you need a more advanced tokenizer than your two-step simple mechanism, if you want to be able to handle different types of tokens (like strings, numbers, symbols, as well as the parentheses themselves). For example it needs to be aware of the context (is it tokenizing a string? A number? Etc.) — Some programmer dude, Jul 02 '21 at 21:03
I've only described part of the parsing process. An enum consisting of LIST_START, STRING, FLOAT, INTEGER, CHARACHTER, FUNCTION, NIL would be in my C program, as well as a struct that held an instance of that enum describing its type, as well as the actual value (within a union) and a pointer to the next "node". — Bithov Vinu, Jul 02 '21 at 21:40
After converting the string S-exp into a list of tokens (opening bracket, closing bracket, strings tied by quotes, integers/floats/charachters/symbols split by whitespace), each token would be run by a function to determine its type (the types being defined in the previously mentioned enum), and a struct instance would be created around this information. Each node would be linked to the next, as is typical in a linked list. I see no reason why a method like the one described couldn't work with strings, numbers, symbols and paranthesises - though I could be wrong, please do correct me. — Bithov Vinu, Jul 02 '21 at 21:40
Where I do think my approach is naive is in terms of speed of parsing - the approach I've taken on seems to be unique, as every other s-expression parser I've seen written in see iterates over characters and incrementally parses them. I'm not sure about the pros and cons of either method over the other, but I would like to see which is more frugal in terms of different resources - memory usage, code complexity, and bulkiness - because frugality and nimbleness (is that a word?) is a major goal with my project. — Bithov Vinu, Jul 02 '21 at 21:45

Erdal Küçük · Answer 1 · 2021-07-02T22:34:17.180

1

You have two delimiters: the space and double quotes.

You can use the strcspn (or with example: cppreference - strcspn) function for that.

Iterate over the string and look for the delimiters (space and quotes). strcspn returns if such a delimiter was found. If a space was found, continue looking for both. If a double quote was found, the delimiter chages from " \"" (space and quotes) to "\"" (double quotes). If you then hit the quotes again, change the delimiter to " \"" (space and quotes).

Based on your comment:

Lets say you have a string like

This is an Example.

The output would be

This
is
an
Example.

If the string would look like

This "is an" Example.

The output would be

This
is an
Example.

edited Jul 02 '21 at 22:34

answered Jul 02 '21 at 21:45

Erdal Küçük

4,810
1
6
11

For what this achieves, this seems like it would be immensely resource intensive - I should clarify I mean in relative terms. As an aside, do you mean splitting with strtok or should I implement a string splitting solution that isn't as frankly dangerous as strtok is? – Bithov Vinu Jul 02 '21 at 22:13
I don't know what you mean. The question was how to split a string based on spaces but only if not in double quotes. I've provided a solution for that. – Erdal Küçük Jul 02 '21 at 22:35
@BithovVinu: Your options are described [here](https://stackoverflow.com/a/7219504/102937). – Robert Harvey Jul 02 '21 at 22:38

Splitting string in C by blank spaces, besides when said blank space is within a set of quotes

1 Answers1