1

I have some user input following this format:

Playa Raco#path#5#39.244|-0.257#0-23

The # here acts as a separator, and the | is also a separator for the latitude and longitude. I would like to extract this information. Note that the strings could have spaces. I tried using the %[^\n]%*c formatter with scanf and adding # and |, but it doesn't work because it matches the whole line.

I would like to keep this as simple as possible, I know that I could do this reading each char, but I'm curious to see best practices and check if there is a scanf or similar alternative for this.

Norhther
  • 545
  • 3
  • 15
  • 35
  • 3
    A common approach is to use [strtok](https://man7.org/linux/man-pages/man3/strtok.3.html) – kaylum Jun 30 '21 at 23:43
  • If you are curios to why something didn't work, then maybe show us that something? – HAL9000 Jun 30 '21 at 23:45
  • 1
    @HAL9000 I'm not curious about why it didn't work. `[^\n]%*c` Is going to match the whole line. I'm asking about best practices, if you want I can add a working function that extracts the tokens doing a linear search, but that's not what I'm asking. – Norhther Jun 30 '21 at 23:46
  • 3
    I hate to recommend `scanf`, but `%[^#]#%[^#]#%[^#]#%[^#]#%[^|]|%[^#]#%s` ought to work. (And just looking at that ghastly mess -- I can't believe I just typed it -- reminds me why I hate to recommend `scanf`!) – Steve Summit Jun 30 '21 at 23:46
  • @SteveSummit Well... That works, not sure if it's best practices tho, but I appreciate it! Thanks! – Norhther Jun 30 '21 at 23:51
  • 2
    Or, alternatively, `%[^#]#%[^#]#%d#%lf|%lf#%s`. – Steve Summit Jun 30 '21 at 23:53
  • But, "best practice"? Depends who you ask. Some people like `scanf`, and for them, the jawbreaker formats I've just constructed would indeed be, if not "best", then at least acceptable practice. But if your heart's not set on `scanf`, then kaylum is right, `strtok` would be an excellent starting point for a better practice. – Steve Summit Jun 30 '21 at 23:55
  • @SteveSummit for this case, I'm gonna say that your first comment was not a "good practice" (from my point of view) because the verbose nature of it. I'm definetely going to check `strtok`, but I would also accept your second comment. – Norhther Jun 30 '21 at 23:58
  • Me, I'd use a variation on the [`getwords` function discussed here](https://www.eskimo.com/~scs/cclass/notes/sx10h.html), but unfortunately there's [nothing like it in the standard library](https://stackoverflow.com/questions/49372173). – Steve Summit Jul 01 '21 at 00:00
  • 1
    Another option would be a regular expression parser. – Steve Summit Jul 01 '21 at 00:02
  • @Norhther, I was trying to comment on the fact that you had tried adding `#` and `|` without telling us how. Sorry for the confusion. – HAL9000 Jul 01 '21 at 00:35
  • @SteveSummit nothing wrong with the approach, but better `fgets()` then `sscanf()` rather then `scanf()` alone. Using a `scanf()`/`sscanf()` approach has the benefit of not modifying the original string. `strcspn()`/`strspn()` can be used to do the same thing as `strtok()` but without modifying the original string. You may as well write up your solution as an answer, or we need to find a dupe and close. – David C. Rankin Jul 01 '21 at 02:15
  • @DavidC.Rankin I said "`scanf`", but I meant "any member of the *scanf family" -- and, yes, of course, `sscanf` is the preferred choice here. I wouldn't have had time to write up an answer tonight, so I'm glad you did. – Steve Summit Jul 01 '21 at 04:29
  • @SteveSummit - glad to do it, but I just wanted to make sure I gave you first shot if you had the time `:)` – David C. Rankin Jul 01 '21 at 04:35

1 Answers1

5

As mentioned in the comments, there are many ways you can parse the information from the string. You can walk a pair of pointers down the string, testing each character and taking the appropriate action, you can use strtok(), but note strtok() modifies the original string, so it cannot be used on a string-literal, you can use sscanf() to parse the values from the string, or you can use any combination of strcspn(), strspn(), strchr(), etc. and then manually copy each field between a start and end pointer.

However, your question also imposes "I would like to keep this as simple as possible..." and that points directly to sscanf(). You simply need to validate the return and you are done. For example, you could do:

#include <stdio.h>

#define MAXC 16     /* adjust as necessary */

int main (void) {
    
    const char *str = "Playa Raco#path#5#39.244|-0.257#0-23";
    char name[MAXC], path[MAXC], last[MAXC];
    int num;
    double lat, lon;
    
    if (sscanf (str, "%15[^#]#%15[^#]#%d#%lf|%lf#%15[^\n]",
                name, path, &num, &lat, &lon, last) == 6) {
        printf ("name : %s\npath : %s\nnum  : %d\n"
                "lat  : %f\nlon  : %f\nlast : %s\n",
                name, path, num, lat, lon, last);
    }
    else
        fputs ("error: parsing values from str.\n", stderr);
}

(note: the %[..] conversion does not consume leading whitespace, so if there is a possibility of leading whitespace or a space following '#' before a string conversion, include a space in the format string, e.g. " %15[^#]# %15[^#]#%d#%lf|%lf# %15[^\n]")

Where each string portion of the input to be split is declared as a 16 character array. Looking at the format-string, you will note the read of each string is limited to 15 characters (plus the nul-terminating) character to ensure you do not attempt to store more characters than your arrays can hold. (that would invoke Undefined Behavior). Since there are six conversions requested, you validate the conversion by ensuring the return is 6.

Example Use/Output

Taking this approach, the output above would be:

./bin/parse_sscanf
name : Playa Raco
path : path
num  : 5
lat  : 39.244000
lon  : -0.257000
last : 0-23

No one way is necessarily "better" than another so long as you validate the conversions and protect the array bounds for any character arrays filled. However, as far as simple as possible goes, it's hard to beat sscanf() here -- and it doesn't modify your original string, so it is safe to use with string-literals.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • Hey David. my code: https://godbolt.org/z/c49fT9GjM Can you share some simple example about `" %15[^#]# %15[^#]#%d#%lf|%lf# %15[^\n]"` I just found it so hard to understand. – jian Jul 28 '22 at 06:58
  • @Mark `" ` (the first space) tells `sscanf()` to ignore all leading whitespace. (all but `"%c"`, `"%[...]"` and `"%n"` ignore leading whitespace on their own). `%15[^#]` read at most 15 chars (*field-width*) in the following list `[...]`, but when the first char is `'^'` or `'!'` invert the list. So `%15[^#]` reads at most 15 characters that do NOT contain the `'#'` character. The following `'#'` means read the literal `'#'` character. `"#%d#%lf"` just reads a `'#'` character followed by an `int` another `'#'` followed by a `double`, and so on... Learning `scanf()` will save much grief. – David C. Rankin Jul 28 '22 at 19:41
  • Also, using `sscanf()` allows the read with `fgets()` which you should use for all user input to avoid the many pitfalls with attempting to use `scanf()` directly for user input. (or for any input for that matter). The only place there is an argument for using `fscanf()` directly is reading a machine created input file where you have a high-guarantee that there will be no formatting errors that will break the read of the remainder of the file. Otherwise `fgets()` followed by `sscanf()` (or any of the other parsing methods) is the way to go reading text input from whatever source. – David C. Rankin Jul 28 '22 at 19:49
  • Hey David. For me the hardest part to understand `%15[^#]#` is that you don't have specifier like %s, %c, or %d. I read from cppreference.com printf. I understand [^#] part. I also understand 15. But here don't have c,d,s specifier WHY? This part I cannot found from cppreference or open group manual. – jian Jul 29 '22 at 04:27
  • @Mark The `"%[...]"` is a specifier all on it's own. That specifier will match any characters within the list `[...]`. (unless the first char is `'^'` or `'!'` in which case the list is negated). The link I provided above to `man 1 scanf` explains it well. In this specific case `"%15[^#]#"` just say read up to 15 chars (saving room for the `'\0'` character) of any character not `'#'` (`"%15[^#]"`). So read until you find `'#'` and stop, do not consider the `'#'` as part of the string you read, leaving it in `stdin`. The `"...#"` at the end then says -- read the next `'#'` (the separator). – David C. Rankin Jul 29 '22 at 04:36
  • So the *format-string* can be view as `"spec1#spec2#spec3#..."` where `spec1` is the first conversion specifier (`"%15[^#]"`) then read the `'#'` separator, then read the text that matches the next conversion specifier `spec2` (same thing), then read the separator, then read the text matching the third conversion specifier `spec3` (an `int`) then the separator -- and so on and so forth until you read all you want to read. – David C. Rankin Jul 29 '22 at 04:40
  • Oops sorry [man 3 scanf](https://man7.org/linux/man-pages/man3/scanf.3.html). It's worth taking an hour or two to learn `scanf()` -- which means don't skim it, but read and understand every word and every conversion specifier and flag involved. If you spend 2 hours learning `scanf()` it will save you 10X that amount of time later `:)` I found it helpful to have a short source file open, and when I was confused on what any part of `scanf()` did, I would write a couple of lines of code exercising that part, compile and make sure I was thinking right. That really helps you digest the man page. – David C. Rankin Jul 29 '22 at 04:44
  • Also remember you cannot use any of the `scanf` family of functions correctly unless you validate the wanted conversions took place by **checking the return**. – David C. Rankin Jul 29 '22 at 04:54