I am working on a machine learning application where my features are stored in huge text files. Currently the way I have implemented the data input reads, it is way to slow to be practical. Basically each line of the text file represents a feature vector in sparse format. For instance, following example contains three features in index:value
fashion.
1:0.34 2:0.67 6:0.99 12:2.1 28:2.1
2:0.12 22:0.27 26:9.8 69:1.8
3:0.24 4:67.0 7:1.9 13:8.1 18:1.7 32:3.4
Following is how I am making the reads now. As I dont know the length of the feature string before hand, I just read a suitably large length which upper bounds the length of each string. Once, I have read the line from the file, I just use the strtok_r
function to split the string into key value pairs and then further process it to store as a sparse array. Any ideas on how to speed this up are highly appreciated.
FILE *fp = fopen(feature_file, "r");
int fvec_length = 0;
char line[1000000];
size_t ln;
char *pair, *single, *brkt, *brkb;
SVECTOR **fvecs = (SVECTOR **)malloc(n_fvecs*sizeof(SVECTOR *));
if(!fvecs) die("Memory Error.");
int j = 0;
while( fgets(line,1000000,fp) ) {
ln = strlen(line) - 1;
if (line[ln] == '\n')
line[ln] = '\0';
fvec_length = 0;
for(pair = strtok_r(line, " ", &brkt); pair; pair = strtok_r(NULL, " ", &brkt)){
fvec_length++;
words = (WORD *) realloc(words, fvec_length*sizeof(WORD));
if(!words) die("Memory error.");
j = 0;
for (single = strtok_r(pair, ":", &brkb); single; single = strtok_r(NULL, ":", &brkb)){
if(j == 0){
words[fvec_length-1].wnum = atoi(single);
}
else{
words[fvec_length-1].weight = atof(single);
}
j++;
}
}
fvec_length++;
words = (WORD *) realloc(words, fvec_length*sizeof(WORD));
if(!words) die("Memory error.");
words[fvec_length-1].wnum = 0;
words[fvec_length-1].weight = 0.0;
fvecs[i] = create_svector(words,"",1);
free(words);
words = NULL;
}
fclose(fp);
return fvecs;