I have 1 set of raw data file(s), each has 8 millions~9 millions lines (yes, 8,000,000~9,000,000) in the following format,
1,2,3,4,5,16,23,35
1,2,3,4,6,17,23,36
1,2,3,4,7,18,23,37
1,2,3,4,8,19,23,38
1,2,3,4,9,20,23,39
1,2,3,4,10,21,23,40
1,2,3,4,11,22,23,41
1,2,3,4,12,23,24,42
1,2,3,4,13,24,25,43
1,2,3,4,14,25,26,44
Each line has 8 sorted numbers and range from 1~49. Another set of "filter" file(s) each has 6 millions ~ 7 millions line in the following format,
13,4,7,8,18,20
9,10,11,12,5,6,7,8,1,2,3,4,21,22,23,24,13,14,15,16,29,30,31,32,45,46,47,48
29,49,36,37,34,17,15,9,16,30,28,47,46,27,20,32,14,26,1,4,3,6,10,2,7,48,44,41
Each line has 4~28 non sorted numbers and range 1~49 I need to compare each line from "raw data" file with every lines in "filter" file and get the intersection value, e.g. line 1 in raw with line 1~3 in filter
1 // since only 4 is in common with filter line 1
7 // since only 35 not found in filter line 2
6 // since 5 23 35 not found in filter line 3
After the comparsion, will output the result according to the threshold value. e.g.
output raw data line with intersection value >= 2,
output raw data line with intersection value == 4
I knew that there are (at most) 9 millions x 8 millions line comparsions. At first, I try using set_intersection to do the job but it takes forever to do the task (the filter line is sorted before pass to set_intersection).
int res[8];
int *it = set_intersection(Raw.Data, Raw.Data+8, FilterVal.begin(), FilterVal.end(), res);
ds = GetIntersect(GDE.DrawRes, LotArr) * 2;
int IntersectCnt=it-res;
Next, I try build up an array of integer zero:
int ResArr[49] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
and use 3 helper functions:
void InitResArr(int * inResArr, vector<int> & FilterVal) {
for (int i = 0; i < FilterVal.size(); i++) {
inResArr[FilterVal[i] - 1] = 1;
}
}
void ResetResArr(int * inResArr, vector<int> & FilterVal) {
for (int i = 0; i < FilterVal.size(); i++) {
inResArr[FilterVal[i] - 1] = 0;
}
}
int GetIntersect(int * inResArr, int * inRawData) {
int RtnVal = 0;
for (int i = 0; i < 8; i++) {
RtnVal+=inResArr[inRawData[i] - 1];
}
But this approach still take over 3 hrs to finish 1 comparsion (1 raw data file with 1 filter). And I have 5,000 raw data files and 40,000 filters to go!!! Is there any other better approach to handle this task ? Thanks.
Regds
LAM Chi-fung