2

I am currently trying to parse UnicodeData.txt with this format: ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html However, I am hitting a problem in that when I try to read, say a line like the following.

something;123D;;LINE TABULATION;

I try to get the data from the fields by code such as the following. The problem is that fields[3] is not getting filled in, and scanf is returning 2. in is the current line.

char fields[4][256];
sscanf(in, "%[^;];%[^;];%[^;];%[^;];%[^;];",
    fields[0], fields[1], fields[2], fields[3]);

I know this is the correct implementation of scanf(), but is there a way to get this to work, short of making my own scanf()?

trolleyman
  • 23
  • 5
  • why are you reading 5 times using 5 % ? – Nikole Apr 09 '14 at 22:07
  • Why do you think this is correct behavior for `scanf()`? I would expect `fields[2]` to be filled in with a blank string and `fields[3]` to be filled in with `LINE TABULATION`. And that is exactly what `scanf()` does in my C++ compiler (C++Builder XE2), which makes me think this is a bug in your compiler's `scanf()` implementation. – Remy Lebeau Apr 09 '14 at 22:17
  • it's actually the correct behaviour for `scanf`, see my answer below... – Massa Apr 09 '14 at 22:23
  • I hate `scanf` family – M.M Apr 09 '14 at 23:17

3 Answers3

4

scanf does not handle "empty" fields. So you will have to parse it on your own.

The following solution is:

  • fast, as it uses strchr rather than the quite slow sscanf
  • flexible, as it will detect an arbitrary number of fields, up to a given maximum.

The function parse extracts fields from the input str, separated by semi-colons. Four semi-colons give five fields, some or all of which can be blank. No provision is made for escaping the semi-colons.

#include <stdio.h>
#include <string.h>

static int parse(char *str, char *out[], int max_num) {
    int num = 0;
    out[num++] = str;
    while (num < max_num && str && (str = strchr(str, ';'))) {
        *str = 0;           // nul-terminate previous field
        out[num++] = ++str; // save start of next field
    }
    return num;
}

int main(void) {
    char test[] = "something;123D;;LINE TABULATION;";
    char *field[99];
    int num = parse(test, field, 99);
    int i;
    for (i = 0; i < num; i++)
        printf("[%s]", field[i]);
    printf("\n");
    return 0;
}

The output of this test program is:

[something][123D][][LINE TABULATION][]

Update: A slightly shorter version, which doesn't require an extra array to store the start of each substring, is:

#include <stdio.h>
#include <string.h>

static int replaceSemicolonsWithNuls(char *p) {
    int num = 0;
    while ((p = strchr(p, ';'))) {
        *p++ = 0;
        num++; 
    }
    return num;
}

int main(void) {
    char test[] = "something;123D;;LINE TABULATION;";
    int num = replaceSemicolonsWithNuls(test);
    int i;
    char *p = test;
    for (i = 0; i < num; i++, p += strlen(p) + 1)
        printf("[%s]", p);
    printf("\n");
    return 0;
}
Joseph Quinsey
  • 9,553
  • 10
  • 54
  • 77
  • 1
    Even when this is not strictly a `scanf` method, it is still a good alternate solution if you want to send the data to array, like in my case. Thanks you, Joseph. – Sopalajo de Arrierez Mar 18 '19 at 21:23
2

Just in case you would like to consider this following alternative, using scanfs and "%n" format-specifier, used for reading in how many characters have been read by far, into an integer:

#include <stdio.h>
#define N 4

int main( ){

    char * str = "something;123D;;LINE TABULATION;";
    char * wanderer = str;
    char fields[N][256] = { 0 };
    int n;

    for ( int i = 0; i < N; i++ ) {
        n = 0;
        printf( "%d ", sscanf( wanderer, "%255[^;]%n", fields[i], &n ) );
        wanderer += n + 1;
    }

    putchar( 10 );

    for ( int i = 0; i < N; i++ )
        printf( "%d: %s\n", i, fields[i] );

    getchar( );
    return 0;
}

On every cycle, it reads maximum of 255 characters into the corresponding fields[i], until it encounters a delimiter semicolon ;. After reading them, it reads in how many characters it had read, into the n, which had been zeroed (oh my...) beforehand.

It increases the pointer that points to the string by the amount of characters read, plus one for the delimiter semicolon.

printf for the return value of sscanf, and the printing of the result is just for demonstration purposes. You can see the code working on http://codepad.org/kae8smPF without the getchar(); and with for declaration moved outside for C90 compliance.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
Utkan Gezer
  • 3,009
  • 2
  • 16
  • 29
1

I don't think sscanf will do what you need: sscanf format %[^;] will match a non-empty sequence of not-semicolon characters. The alternative would be using readline with the separator being ';', like:

#include <iostream>
#include <sstream>
#include <string>

int main() {
  using namespace std;
  istringstream i { "something;123D;;LINE TABULATION;\nsomething;123D;;LINE TABULATION;\nsomething;123D;;LINE TABULATION;\n" };
  string a, b, c, d, newline;
  while( getline(i, a, ';') && getline(i, b, ';') && getline(i, c, ';') && getline (i, d, ';') && getline(i, newline) )
    cout << d << ',' << c << '-' << b << ':' << a << endl; 
}

(I have only seen you took the c++ tag off this question now, if your problem is c-only, I have another solution, below:)

#include <string.h>
#include <stdio.h>

int main() {
  typedef char buffer[2048];
  buffer line;
  while( fgets(line, sizeof(line), stdin) > 0 ) {
    printf("(%s)\n", line);
    char *end = line;
    char *s1 = *end == ';' ? (*end = '\0'), end++ : strtok_r(end, ";", &end);
    char *s2 = *end == ';' ? (*end = '\0'), end++ : strtok_r(end, ";", &end);
    char *s3 = *end == ';' ? (*end = '\0'), end++ : strtok_r(end, ";", &end);
    char *s4 = *end == ';' ? (*end = '\0'), end++ : strtok_r(end, ";", &end);
    printf("[%s][%s][%s][%s]\n", s4, s3, s2, s1);
  }
}
Massa
  • 8,647
  • 2
  • 25
  • 26
  • Argh. Sometimes I just think the STL is a little ugly, but I guess this is the only way. Thanks very much! – trolleyman Apr 09 '14 at 22:19
  • @trolleyman: I guess _I_ am one of the people that think it's just beautiful, so we'll have to agree on disagreeing (on the beauty of the STL) **but** anyway, I put a C-only solution above! (and yes, `strtok`/`strtok_r` **skips** empty fields, so we have to take care of them separately...) :D – Massa Apr 09 '14 at 22:42