I'm relatively new to C/C++, and am currently (attempting to) use it to parse large, formatted text files containing numerical data into arrays, so as to be able to work with these using the LAPACK Library.
The textfiles I am parsing have a very simple format: a 5 line header followed by 50 values, the next 5 line header and 50 values, repeated about approx. 1 million or so times:
5 line header
1.000000E+00 2.532093E+02
2.000000E+00 7.372978E+02
3.000000E+00 5.690047E+02
My current approach is to use the fscanf function, but I am getting strange results. I'm currently using a very naive approach to skip over the lines containing the header text, but I fear this might be the problem. Or perhaps my use of fscanf is flawed. Here is what I have so far:
int main() {
FILE *ifp;
FILE *ofp;
char mystring[500];
int i,j,n;
//ofp = fopen("newfile.txt","w");
ifp = fopen("results","r");
if (ifp != NULL) {
//Test with 10 result blocks each containing 50 frequency values
float** A = fmatrix(50,10);
for (j=0; j<10; j++) {
//fscanf(ifp, "%*[^\n]\n", NULL);
//fscanf(ifp, "%*[^\n]\n", NULL);
//fscanf(ifp, "%*[^\n]\n", NULL);
//fscanf(ifp, "%*[^\n]\n", NULL);
//fscanf(ifp, "%*[^\n]\n", NULL);
//using fgets w/ printf to see contents of "discarded" lines
fgets(mystring,500,ifp); printf("%s",mystring);
fgets(mystring,500,ifp); printf("%s",mystring);
fgets(mystring,500,ifp); printf("%s",mystring);
fgets(mystring,500,ifp); printf("%s",mystring);
fgets(mystring,500,ifp); printf("%s",mystring);
for (i=0; i<50; i++) {
//skip over first float, store the next float into A[i][j]
n=fscanf(ifp," %*e %E", &A[i][j]);
printf("A[%i][%i]: %E, %i\n",i,j,A[i][j],n);
}
}
}
return 0;
}
float** fmatrix(int m, int n) {
//Return an m x n Matrix
int i;
float** A = (float**)malloc(m*sizeof(float*));
A[0] = (float*)malloc(m*n*sizeof(float));
for (i = 1; i < m; i++) {
A[i] = A[i-1]+n;
}
return A;
}
What I get as a result is curious. I get a 50 component column vector which match up with the result file, then get 50 zeros as the second column vector, and the third column vector corresponds to the second value in my results file, and so on. That is, I get alternating columns of zeros and non-zero values in my matrix. I later inserted the fscanf lines to see what was going on, and to my surprise, some of the lines being discarded were lines which contained numeric data, and not just header lines.
I was hoping someone could maybe have an idea what is, or what could be wrong here? Since this is such a simple format, I really don't even know where the problem could lie. Another related question is: what is the preferred method for skipping over header text? The method I am using is practically single-use only, since any changes in header / file format would render the code worthless. Perhaps use fgets to check whether the format matches the data part of the file, and skip over any lines that do not match the 2-column pattern?
A final question regarding performance: Bugs aside, is fscanf the best way to proceed here? As I mentioned earlier, these files can sometimes have sizes of several hundred million lines, and I'm not at all well enough versed in C/C++ to know if there are faster ways of reading such large amounts of lines into matrices / vectors.
I hope I have provided enough information here to make my question clear. If need be, I can post excerpts of my results files here.