0

I'm relatively new to C/C++, and am currently (attempting to) use it to parse large, formatted text files containing numerical data into arrays, so as to be able to work with these using the LAPACK Library.
The textfiles I am parsing have a very simple format: a 5 line header followed by 50 values, the next 5 line header and 50 values, repeated about approx. 1 million or so times:

5 line header
1.000000E+00 2.532093E+02
2.000000E+00 7.372978E+02
3.000000E+00 5.690047E+02

My current approach is to use the fscanf function, but I am getting strange results. I'm currently using a very naive approach to skip over the lines containing the header text, but I fear this might be the problem. Or perhaps my use of fscanf is flawed. Here is what I have so far:

int main() {
    FILE *ifp; 
    FILE *ofp;
    char mystring[500];
    int i,j,n;

  //ofp = fopen("newfile.txt","w");
    ifp = fopen("results","r");
    if (ifp != NULL) {
  //Test with 10 result blocks each containing 50 frequency values
    float** A = fmatrix(50,10);
    for (j=0; j<10; j++) {
        //fscanf(ifp, "%*[^\n]\n", NULL);
        //fscanf(ifp, "%*[^\n]\n", NULL);
        //fscanf(ifp, "%*[^\n]\n", NULL);
        //fscanf(ifp, "%*[^\n]\n", NULL);
        //fscanf(ifp, "%*[^\n]\n", NULL);

        //using fgets w/ printf to see contents of "discarded" lines
        fgets(mystring,500,ifp); printf("%s",mystring);
        fgets(mystring,500,ifp); printf("%s",mystring);
        fgets(mystring,500,ifp); printf("%s",mystring);
        fgets(mystring,500,ifp); printf("%s",mystring);
        fgets(mystring,500,ifp); printf("%s",mystring);

        for (i=0; i<50; i++) {
            //skip over first float, store the next float into A[i][j]
            n=fscanf(ifp," %*e %E", &A[i][j]);
            printf("A[%i][%i]: %E, %i\n",i,j,A[i][j],n);
        }
    }
}
return 0;
}

float** fmatrix(int m, int n) {
    //Return an m x n Matrix
    int i;
    float** A = (float**)malloc(m*sizeof(float*));
    A[0] = (float*)malloc(m*n*sizeof(float));
    for (i = 1; i < m; i++) {
        A[i] = A[i-1]+n;
    }
return A;
}  

What I get as a result is curious. I get a 50 component column vector which match up with the result file, then get 50 zeros as the second column vector, and the third column vector corresponds to the second value in my results file, and so on. That is, I get alternating columns of zeros and non-zero values in my matrix. I later inserted the fscanf lines to see what was going on, and to my surprise, some of the lines being discarded were lines which contained numeric data, and not just header lines.

I was hoping someone could maybe have an idea what is, or what could be wrong here? Since this is such a simple format, I really don't even know where the problem could lie. Another related question is: what is the preferred method for skipping over header text? The method I am using is practically single-use only, since any changes in header / file format would render the code worthless. Perhaps use fgets to check whether the format matches the data part of the file, and skip over any lines that do not match the 2-column pattern?

A final question regarding performance: Bugs aside, is fscanf the best way to proceed here? As I mentioned earlier, these files can sometimes have sizes of several hundred million lines, and I'm not at all well enough versed in C/C++ to know if there are faster ways of reading such large amounts of lines into matrices / vectors.

I hope I have provided enough information here to make my question clear. If need be, I can post excerpts of my results files here.

user999318
  • 27
  • 6
  • Which is it C or C++? Guess using `fscanf` is C. So remove the tag – Ed Heal Dec 29 '13 at 23:12
  • 1
    What does the header look like? If it was me I'd prefer to detect the header lines and discard them rather than assume the file is always in the perfect format. – Retired Ninja Dec 29 '13 at 23:32
  • 2
    Just a quick tip: fscanf, since it can run over end-of-line, tends to confuse programmers. It's often easier to read each line into a string and use sscanf over that. – keshlam Dec 30 '13 at 00:10
  • [Don't cast the return value of `malloc()`](http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc/605858#605858). A 2D array is preferred over a poiner-to-pointer: `float (*a)[n] = malloc(sizeof(*a) * m);`. Don't use `scanf()` because it's confusing -- sane alternatives for getting and parsing user input include `fgets()`, `strstr()`, `strtok_r()`, `strtod()`, `strtol()`, etc. –  Dec 30 '13 at 00:27

1 Answers1

1

Because you are not consistently using fgets(), you read 5 header lines OK, then 50 numbers, but the last number leaves the newline on line 55 ready to be read by the first fgets() or the next block of header lines. So the second block of header reading reads the newline (only), then 4 header lines, then the data scanning tries to read the last line of the header as a number and (probably) fails.

Always check the return value from every input function (even if it seems to make life painful).

And, I suggest, use fgets() to read each line. Skip the heading lines; use sscanf() to convert the data on the data lines. But check both fgets() and sscanf() for the correct return values. There are other functions to convert strings to numbers; strtod() could be used.

Here's some working code, cut down to work on 5 blocks of data with 10 lines per set (and still 5 header lines):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

extern float **fmatrix(int m, int n);

enum { HDRS = 5, ROWS = 10, COLS = 5 };

static int read_line(FILE *ifp, char *buffer, size_t buflen)
{
    if (fgets(buffer, buflen, ifp) == 0)
    {
        fprintf(stderr, "EOF\n");
        return 0;
    }
    size_t len = strlen(buffer);
    buffer[len-1] = '\0';
    printf("[[%s]]\n", buffer);
    return 1;
}

int main(void)
{
    FILE *ifp;
    char mystring[500];
    int i, j, n;

    ifp = stdin;
    if (ifp != NULL)
    {
        // Test with COLS result blocks each containing ROWS frequency values
        float **A = fmatrix(ROWS, COLS);
        for (j = 0; j < COLS; j++)
        {
            // using fgets w/ printf to see contents of "discarded" lines
            for (i = 0; i < HDRS; i++)
            {
                if (read_line(ifp, mystring, sizeof(mystring)) == 0)
                    break;
            }

            for (i = 0; i < ROWS; i++)
            {
                // skip over first float, store the next float into A[i][j]
                if (read_line(ifp, mystring, sizeof(mystring)) == 0)
                    break;
                if ((n = sscanf(mystring, " %*e %E", &A[i][j])) != 1)
                    break;
                printf("A[%i][%i]: %E, %i\n", i, j, A[i][j], n);
            }
        }

        for (i = 0; i < ROWS; i++)
        {
            for (j = 0; j < COLS; j++)
                printf("%8.3f", A[i][j]);
            putchar('\n');
        }
    }
    return 0;
}

float **fmatrix(int m, int n)
{
    // Return an m x n Matrix
    int i;
    float **A = (float **)malloc(m * sizeof(float *));
    A[0] = (float *)malloc(m * n * sizeof(float));
    for (i = 1; i < m; i++)
    {
        A[i] = A[i - 1] + n;
    }
    return A;
}

Smaller data file:

Line 1 of heading 1
Line 2 of heading 1
Line 3 of heading 1
Line 4 of heading 1
Line 5 of heading 1
18.1815 56.4442
12.0478 15.5530
47.7793 44.5291
30.8319 78.9396
53.5651 28.1290
74.9131 90.5912
34.9319 10.5254
69.7780 56.8633
92.5056 11.8101
82.0158 31.7586
Line 1 of heading 2
Line 2 of heading 2
Line 3 of heading 2
Line 4 of heading 2
Line 5 of heading 2
118.15 564.442
104.78 155.530
477.93 445.291
383.19 789.396
556.51 281.290
791.31 905.912
393.19 105.254
677.80 568.633
950.56 118.101
801.58 317.586
Line 1 of heading 3
Line 2 of heading 3
Line 3 of heading 3
Line 4 of heading 3
Line 5 of heading 3
18.1815 36.4442
12.0478 35.5530
47.7793 34.5291
30.8319 38.9396
53.5651 38.1290
74.9131 30.5912
34.9319 30.5254
69.7780 36.8633
92.5056 31.8101
82.0158 31.7586
Line 1 of heading 4
Line 2 of heading 4
Line 3 of heading 4
Line 4 of heading 4
Line 5 of heading 4
118.15 464.442
104.78 455.530
477.93 445.291
383.19 489.396
556.51 481.290
791.31 405.912
393.19 405.254
677.80 468.633
950.56 418.101
801.58 417.586
Line 1 of heading 5
Line 2 of heading 5
Line 3 of heading 5
Line 4 of heading 5
Line 5 of heading 5
118.15 564.442
104.78 555.530
477.93 545.291
383.19 589.396
556.51 581.290
791.31 505.912
393.19 505.254
677.80 568.633
950.56 518.101
801.58 517.586

Note that the block of 20 random numbers was edited in different ways to get different numbers in each block. There's a strong genetic resemblance between the values in the blocks, though.

Result of running the program on the data file.

[[Line 1 of heading 1]]
[[Line 2 of heading 1]]
[[Line 3 of heading 1]]
[[Line 4 of heading 1]]
[[Line 5 of heading 1]]
[[18.1815 56.4442]]
A[0][0]: 5.644420E+01, 1
[[12.0478 15.5530]]
A[1][0]: 1.555300E+01, 1
[[47.7793 44.5291]]
A[2][0]: 4.452910E+01, 1
[[30.8319 78.9396]]
A[3][0]: 7.893960E+01, 1
[[53.5651 28.1290]]
A[4][0]: 2.812900E+01, 1
[[74.9131 90.5912]]
A[5][0]: 9.059120E+01, 1
[[34.9319 10.5254]]
A[6][0]: 1.052540E+01, 1
[[69.7780 56.8633]]
A[7][0]: 5.686330E+01, 1
[[92.5056 11.8101]]
A[8][0]: 1.181010E+01, 1
[[82.0158 31.7586]]
A[9][0]: 3.175860E+01, 1
[[Line 1 of heading 2]]
[[Line 2 of heading 2]]
[[Line 3 of heading 2]]
[[Line 4 of heading 2]]
[[Line 5 of heading 2]]
[[118.15 564.442]]
A[0][1]: 5.644420E+02, 1
[[104.78 155.530]]
A[1][1]: 1.555300E+02, 1
[[477.93 445.291]]
A[2][1]: 4.452910E+02, 1
[[383.19 789.396]]
A[3][1]: 7.893960E+02, 1
[[556.51 281.290]]
A[4][1]: 2.812900E+02, 1
[[791.31 905.912]]
A[5][1]: 9.059120E+02, 1
[[393.19 105.254]]
A[6][1]: 1.052540E+02, 1
[[677.80 568.633]]
A[7][1]: 5.686330E+02, 1
[[950.56 118.101]]
A[8][1]: 1.181010E+02, 1
[[801.58 317.586]]
A[9][1]: 3.175860E+02, 1
[[Line 1 of heading 3]]
[[Line 2 of heading 3]]
[[Line 3 of heading 3]]
[[Line 4 of heading 3]]
[[Line 5 of heading 3]]
[[18.1815 36.4442]]
A[0][2]: 3.644420E+01, 1
[[12.0478 35.5530]]
A[1][2]: 3.555300E+01, 1
[[47.7793 34.5291]]
A[2][2]: 3.452910E+01, 1
[[30.8319 38.9396]]
A[3][2]: 3.893960E+01, 1
[[53.5651 38.1290]]
A[4][2]: 3.812900E+01, 1
[[74.9131 30.5912]]
A[5][2]: 3.059120E+01, 1
[[34.9319 30.5254]]
A[6][2]: 3.052540E+01, 1
[[69.7780 36.8633]]
A[7][2]: 3.686330E+01, 1
[[92.5056 31.8101]]
A[8][2]: 3.181010E+01, 1
[[82.0158 31.7586]]
A[9][2]: 3.175860E+01, 1
[[Line 1 of heading 4]]
[[Line 2 of heading 4]]
[[Line 3 of heading 4]]
[[Line 4 of heading 4]]
[[Line 5 of heading 4]]
[[118.15 464.442]]
A[0][3]: 4.644420E+02, 1
[[104.78 455.530]]
A[1][3]: 4.555300E+02, 1
[[477.93 445.291]]
A[2][3]: 4.452910E+02, 1
[[383.19 489.396]]
A[3][3]: 4.893960E+02, 1
[[556.51 481.290]]
A[4][3]: 4.812900E+02, 1
[[791.31 405.912]]
A[5][3]: 4.059120E+02, 1
[[393.19 405.254]]
A[6][3]: 4.052540E+02, 1
[[677.80 468.633]]
A[7][3]: 4.686330E+02, 1
[[950.56 418.101]]
A[8][3]: 4.181010E+02, 1
[[801.58 417.586]]
A[9][3]: 4.175860E+02, 1
[[Line 1 of heading 5]]
[[Line 2 of heading 5]]
[[Line 3 of heading 5]]
[[Line 4 of heading 5]]
[[Line 5 of heading 5]]
[[118.15 564.442]]
A[0][4]: 5.644420E+02, 1
[[104.78 555.530]]
A[1][4]: 5.555300E+02, 1
[[477.93 545.291]]
A[2][4]: 5.452910E+02, 1
[[383.19 589.396]]
A[3][4]: 5.893960E+02, 1
[[556.51 581.290]]
A[4][4]: 5.812900E+02, 1
[[791.31 505.912]]
A[5][4]: 5.059120E+02, 1
[[393.19 505.254]]
A[6][4]: 5.052540E+02, 1
[[677.80 568.633]]
A[7][4]: 5.686330E+02, 1
[[950.56 518.101]]
A[8][4]: 5.181010E+02, 1
[[801.58 517.586]]
A[9][4]: 5.175860E+02, 1
  56.444 564.442  36.444 464.442 564.442
  15.553 155.530  35.553 455.530 555.530
  44.529 445.291  34.529 445.291 545.291
  78.940 789.396  38.940 489.396 589.396
  28.129 281.290  38.129 481.290 581.290
  90.591 905.912  30.591 405.912 505.912
  10.525 105.254  30.525 405.254 505.254
  56.863 568.633  36.863 468.633 568.633
  11.810 118.101  31.810 418.101 518.101
  31.759 317.586  31.759 417.586 517.586
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Thank you so much! The (apparent) simplicity of fscanf had seduced me, but also got me into trouble. Reading line by line and always checking the return value it is in the future! – user999318 Dec 30 '13 at 00:57
  • @user999318: words of wisdom … yes, that generally makes life more predictable. `scanf()` and `fscanf()` are fearsomely difficult to use reliably, and they also make it hard to report errors well when you don't know how many lines of blanks they ate, or how much they managed to convert of the input before giving up with an error. Using `fgets()` or a relative, you at least can present the whole line of input in the error message. – Jonathan Leffler Dec 30 '13 at 01:15
  • @Jonathan Leffler Under select situations, `fgets()` returns a `""` or a string that does not end in `'\n'`. Do you see value in `if (len > 0 && buffer[len-1] == '\n') buffer[--len] = '\0';`? – chux - Reinstate Monica Dec 30 '13 at 02:59
  • You're correct, and I'm being lazy. The only time it can return a zero length string is on EOF or error, so the `len > 0` check should be irrelevant (unless you pass a buffer of size 0 or 1, but I'm giving some credit for non-idiocy). If the lines are overlong, then I'll chop the last character that was read and will also treat the residue as a second (and then third, fourth, ...) line. It boils down to how much do you trust your input, and how much damage will occur if you guess wrong. For this context, a 500-byte buffer will store any reasonably formatted pair of numbers without difficulty. – Jonathan Leffler Dec 30 '13 at 03:07