16

I hear this from a lot of programmers that the use of strtok maybe deprecated in near future. Some say it is still. Why is it a bad choice? strtok() works great in tokenizing a given string. Does it have to do anything with the time and space complexities? Best link I found on the internet was this. But that doesn't seem to solve my curiousity. Suggest any alternatives if possible.

Pushan Gupta
  • 3,697
  • 5
  • 23
  • 39
  • 3
    At least my own argument is that it is misleadingly destructive. It modifies the source string which generally one does not want to do while tokenising. – Vality Jun 02 '17 at 20:20
  • 2
    For me, once I got comfortable with using regcomp and regexec I found using regex(3) to be much more useful and powerful. – Deathgrip Jun 02 '17 at 20:24
  • 2
    Possible duplicate of [Why is strtok() Considered Unsafe?](https://stackoverflow.com/questions/5999418/why-is-strtok-considered-unsafe) – phuclv Jun 03 '17 at 03:40

2 Answers2

22

Why is it a bad choice?

The fundamental technique for solving problems by programming is to construct abstractions which can be used reliably to solve sub-problems, and then compose solutions to those sub-problems into solutions to larger problems.

strtok's behaviour works directly against these goals in a variety of ways; it is a poor abstraction that is unreliable because it composes poorly.

The fundamental problem of tokenization is: given a position in a string, give the position of the end of the token beginning at that position. If strtok did only that, it would be great. It would have a clear abstraction, it would not rely on hidden global state, it would not modify its inputs.

To see the limitations of strtok, imagine trying to tokenize a language where we wish to separate tokens by spaces, unless the token is enclosed in " ", in which case we wish to apply a different tokenization rule to the contents of the quoted area, and then pick up with the space separation rule after. strtok composes very poorly with itself, and is therefore only useful for the most trivial of tokenization tasks.

Does it have to do anything with the time and space complexities?

No.

Suggest any alternatives if possible.

Lexers are not hard to write; just write one!

Bonus points if you write an immutable lexer. An immutable lexer is a little struct that contains a reference to the string being lexed, the current position of the lexer, and any state needed by the lexer. To extract a token you call a "next token" method, pass in the lexer, and you get back the token and a new lexer. The new lexer can then be used to lex the next token, and you discard the previous lexer if you wish.

The immutable lexer technique is easier to reason about than lexers which modify state. And you can debug them by saving the discarded lexers in a list, and now you have the complete history of tokenization operations open to inspection at once.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • 1
    I never saw lexers this way. Thanks for bringing that up – Pushan Gupta Jun 02 '17 at 20:37
  • 1
    *"it would not modify its inputs"* ... which is actually invalid in some common situations; for example, `strtok("hello world", " ")` is *clearly wrong* to a seasoned C programmer, yet to a beginner this seems like it'd be fine and dandy! Nonetheless, it's an easy mistake to make for both. – autistic Jun 03 '17 at 01:28
  • 1
    While this answer describes `strtok`'s limitations in comparison to proper lexers, I don't feel that it directly explains why it should be *deprecated* (except for a brief mention that "strtok composes very poorly with itself" without explaining *why* it composes poorly). Also, usually things that are deprecated from the standard library are replaced with something else (which in the case of `strtok` probably would be something like `strtok_r` or `strtok_s`). – jamesdlin Jun 03 '17 at 04:57
  • @jamesdlin Or strsep, or some logic using strcspn and memcpy as building blocks. – Random832 Jun 03 '17 at 05:38
  • 3
    @jamesdlin: I encourage you to write an answer that you like better, that we might all benefit from your insights. – Eric Lippert Jun 05 '17 at 16:41
16

The limitation of strtok(char *str, const char *delim) is that it can't work on multiple strings simultaneously as it maintains a static pointer to store the index till it has parsed (hence sufficient if playing with only one string at a time). The better and safer method is to use strtok_r(char *str, const char *delim, char **saveptr) which explicitly takes a third pointer to save the parsed index.

Shashwat Kumar
  • 5,159
  • 2
  • 30
  • 66