The last time update: my classmate uses fread()
to read about one third of the whole file into a string, this can avoid lacking of memory. Then process this string, separate this string into your data structure. Notice, you need to care about one problem: at the end of this string, these last several characters may cannot consist one whole number. Think about one way to detect this situation so you can connect these characters with the first several characters of the next string.
Each number is corresponding to different variable in your data structure. Your data structure should be very simple because each time if you insert your data into one data structure, it is very slow. The most of time is spent on inserting data into data structure. Therefore, the fastest way to process these data is: using fread()
to read this file into a string, separate this string into different one-dimensional arrays.
For example(just an example, not come from my project), I have a text file, like:
72 24 20
22 14 30
23 35 40
42 29 50
19 22 60
18 64 70
.
.
.
Each row is one person's information. The first column means the person's age, the second column is his deposit, the second is his wife's age.
Then we use fread()
to read this text file into string, then I use stroke()
to separate it(you can use faster way to separate it).
Don't use data structure to store the separated data!
I means, don't do like this:
struct person
{
int age;
int deposit;
int wife_age;
};
struct person *my_data_store;
my_data_store=malloc(sizeof(struct person)*length_of_this_array);
//then insert separated data into my_data_store
Don't use data structure to store data! The fastest way to store your data is like this:
int *age;
int *deposit;
int *wife_age;
age=(int*)malloc(sizeof(int)*age_array_length);
deposit=(int*)malloc(sizeof(int)*deposit_array_length);
wife_age=(int*)malloc(sizeof(int)*wife_array_length);
// the value of age_array_length,deposit_array_length and wife_array_length will be known by using `wc -l`.You can use wc -l to get the value in your C program
// then you can insert separated data into these arrays when you use `stroke()` to separate them.
The second update: The best way is to use freed()
to read part of the file into a string, then separate these string into your data structure. By the way, don't use any standard library function which can format string into integer , that's to slow, like fscanf() or atoi()
, we should write our own function to transfer a string into n integer. Not only that, we should design a more simpler data structure to store these data. By the way, my classmate can read this 1.7G file within 7 seconds. There is a way can do this. That way is much better than using multithread. I haven't see his code, after I see his code, I will update the third time to tell you how could hi do this. That will be two months later after our course finished.
Update: I use multithread to solve this problem!! It works! Notice: don't use clock() to calculate the time when using multithread, that's why I thought the time of execution increases.
One thing I want to clarify is that, the time of reading the file without storing the value into my structure is about 20 seconds. The time of storing the value into my structure is about 60 seconds. The definition of "time of reading the file" includes the time of read the whole file and store the value into my structure. the time of reading the file = scan the file + store the value into my structure. Therefore, have some suggestions of storing value faster ? (By the way, I don't have control over the inout file, it is generated by our professor. I am trying to use multithread to solve this problem, if it works, I will tell you the result.)
I have a file, its size is 1.7G. It looks like:
1 1427826
1 1427827
1 1750238
1 2
2 3
2 4
3 5
3 6
10 7
11 794106
.
.
and son on.
It has about ten millions of lines in the file. Now I need to read this file and store these numbers in my data structure within 15 seconds.
I have tried to use freed()
to read whole file and then use strtok()
to separate each number, but it still need 80 seconds. If I use fscanf()
, it will be slower. How do I speed it up? Maybe we cannot make it less than 15 seconds. But 80 seconds to read it is too long. How to read it as fast as we can?
Here is part of my reading code:
int Read_File(FILE *fd,int round)
{
clock_t start_read = clock();
int first,second;
first=0;
second=0;
fseek(fd,0,SEEK_END);
long int fileSize=ftell(fd);
fseek(fd,0,SEEK_SET);
char * buffer=(char *)malloc(sizeof(char)*fileSize);
char *string_first;
long int newFileSize=fread(buffer,1,fileSize,fd);
char *string_second;
while(string_first!=NULL)
{
first=atoi(string_first);
string_second=strtok(NULL," \t\n");
second=atoi(string_second);
string_first=strtok(NULL," \t\n");
max_num= first > max_num ? first : max_num ;
max_num= second > max_num ? second : max_num ;
root_level=first/NUM_OF_EACH_LEVEL;
leaf_addr=first%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=first)
{
root_addr[root_level][leaf_addr].node_value=first;
root_addr[root_level][leaf_addr].head=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].tail=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].g_credit[0]=1;
root_addr[root_level][leaf_addr].head->neighbor_value=second;
root_addr[root_level][leaf_addr].head->next=NULL;
root_addr[root_level][leaf_addr].tail=root_addr[root_level][leaf_addr].head;
root_addr[root_level][leaf_addr].degree=1;
}
else
{
//insert its new neighbor
Neighbor *newNeighbor;
newNeighbor=(Neighbor*)malloc(sizeof(Neighbor));
newNeighbor->neighbor_value=second;
root_addr[root_level][leaf_addr].tail->next=newNeighbor;
root_addr[root_level][leaf_addr].tail=newNeighbor;
root_addr[root_level][leaf_addr].degree++;
}
root_level=second/NUM_OF_EACH_LEVEL;
leaf_addr=second%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=second)
{
root_addr[root_level][leaf_addr].node_value=second;
root_addr[root_level][leaf_addr].head=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].tail=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].head->neighbor_value=first;
root_addr[root_level][leaf_addr].head->next=NULL;
root_addr[root_level][leaf_addr].tail=root_addr[root_level][leaf_addr].head;
root_addr[root_level][leaf_addr].degree=1;
root_addr[root_level][leaf_addr].g_credit[0]=1;
}
else
{
//insert its new neighbor
Neighbor *newNeighbor;
newNeighbor=(Neighbor*)malloc(sizeof(Neighbor));
newNeighbor->neighbor_value=first;
root_addr[root_level][leaf_addr].tail->next=newNeighbor;
root_addr[root_level][leaf_addr].tail=newNeighbor;
root_addr[root_level][leaf_addr].degree++;
}
}