2

I have a multiline TSV file with the following format:

Type\tBasic Name\tAttribute\tA Long Description\n

As you can see, the Basic Name and the Description can both contain some number of spaces. I am trying to read each line in and extract the elements. For now, I've narrowed it down to just extracting the basic name. My fscanf is as follows:

fscanf(file_in, "%*[^ ]s\t%128[^ ]s\t%*[^ ]s\t%[^ ]s\n", name_string, desc_string);

This doesn't work as I have hoped, and I'm having trouble narrowing down the error. Does anyone know how I could read in the lines properly?

Tanaki
  • 2,575
  • 6
  • 30
  • 41

3 Answers3

3

I mostly agree with Pablo (that the scanf family don't make great parsers), but it's worth understanding how to write a scanf pattern. The pattern you're looking for is something like this:

fscanf(" %*[^\t] %128[^\t] %*[^\t] %128[^\n]", name_string, desc_string)

Notes:

  1. %[xyz] is a directive. %[xyz]s is two directives, the second of which matches a literal s

  2. As far a I know, there is no way to match a single literal tab character, since any whitespace in the pattern matches any amount of whitespace (including none) in the input. I used a space in my example, which will match a terminating tab, but it will also match any number of consecutive tabs so empty fields won't be parsed correctly.

  3. The 128-character limit does not include the terminating NUL character.

  4. Also, if the scan stops because the chracter limit is exceeded, it won't skip the rest of the field automatically, so you'll end up out of synch with the input.

A better pattern would be:

fscanf(" %*[^\t] %128[^\t]%*[^\t] %*[^\t] %128[^\n]%*[^\n]", name_string, desc_string)

which explicitly skips the remaining characters in the field, if necessary. An even better solution would be to use the a modifier and get fscanf to malloc memory for you.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Ah, I see now. I was a bit confused on the use of [], but this makes sense. Although, I'm not sure I see the advantage of using %128[^\t]%*[^\t] over %128[^\t] for reading in the data. – Tanaki Oct 11 '12 at 00:40
  • 1
    @Tanaki because if you limit to 128 characters, it stops scanning after 128 characters and the remaining characters in the field will be matched by the next directive, which was supposed to match the next field. – rici Oct 11 '12 at 00:54
  • You're a lifesaver, I forgot about overflow potential! – Tanaki Oct 11 '12 at 01:00
2

I'd rather use strtok for this. It's more acurate than fscanf since this function family only work when the format is 100% OK, otherwise you end up missing values.

Take a look at Parallel to PHP's "explode" in C: Split char* into char* using delimiter, where I explain in more detail how to use strtok.

So, read each line with fgets and parse it with strtok.

Community
  • 1
  • 1
Pablo
  • 13,271
  • 4
  • 39
  • 59
0

Firstly, as it has already been noted, the %[] is a conversion specifier by itself. There's no s after the []. The s-es that you have in your format string will not be considered parts of the conversion specifiers. You have to get rid of those s-es.

Secondly, as you said yourself, your file is TAB-separated. Which immediately means that you should extract the continuous portions of the sequence by using the %[^\t] conversion specifier (or the %[^\n] specifier for the last portion). Why did you use %[^ ] and how did you expect it to work? The %[^ ] actually stops parsing at space character, which is the opposite of what you wanted.

In your example the proper combination of specifiers would be

fscanf(file_in, "%*[^\t]\t%128[^\t]\t%*[^\t]\t%[^\n]\n", name_string, desc_string);

This format string assumes that all 4 portions of the string are guaranteed to be present and that the last portion is guaranteed to be terminated by \n.

AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765