2

I have 1 set of raw data file(s), each has 8 millions~9 millions lines (yes, 8,000,000~9,000,000) in the following format,

1,2,3,4,5,16,23,35
1,2,3,4,6,17,23,36
1,2,3,4,7,18,23,37
1,2,3,4,8,19,23,38
1,2,3,4,9,20,23,39
1,2,3,4,10,21,23,40
1,2,3,4,11,22,23,41
1,2,3,4,12,23,24,42
1,2,3,4,13,24,25,43
1,2,3,4,14,25,26,44

Each line has 8 sorted numbers and range from 1~49. Another set of "filter" file(s) each has 6 millions ~ 7 millions line in the following format,

13,4,7,8,18,20
9,10,11,12,5,6,7,8,1,2,3,4,21,22,23,24,13,14,15,16,29,30,31,32,45,46,47,48
29,49,36,37,34,17,15,9,16,30,28,47,46,27,20,32,14,26,1,4,3,6,10,2,7,48,44,41

Each line has 4~28 non sorted numbers and range 1~49 I need to compare each line from "raw data" file with every lines in "filter" file and get the intersection value, e.g. line 1 in raw with line 1~3 in filter

1  // since only 4 is in common with filter line 1
7  // since only 35 not found in filter line 2
6  // since 5 23 35 not found in filter line 3    

After the comparsion, will output the result according to the threshold value. e.g.

output raw data line with intersection value >= 2,
output raw data line with intersection value == 4

I knew that there are (at most) 9 millions x 8 millions line comparsions. At first, I try using set_intersection to do the job but it takes forever to do the task (the filter line is sorted before pass to set_intersection).

int res[8];
int *it = set_intersection(Raw.Data, Raw.Data+8, FilterVal.begin(), FilterVal.end(), res);
ds = GetIntersect(GDE.DrawRes, LotArr) * 2;
int IntersectCnt=it-res;

Next, I try build up an array of integer zero:

int ResArr[49] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};

and use 3 helper functions:

void InitResArr(int * inResArr, vector<int> & FilterVal) {
    for (int i = 0; i < FilterVal.size(); i++) {
        inResArr[FilterVal[i] - 1] = 1;
    }
}
void ResetResArr(int * inResArr, vector<int> & FilterVal) {
    for (int i = 0; i < FilterVal.size(); i++) {
        inResArr[FilterVal[i] - 1] = 0;
    }
}

int GetIntersect(int * inResArr, int * inRawData) {
    int RtnVal = 0;
    for (int i = 0; i < 8; i++) {
        RtnVal+=inResArr[inRawData[i] - 1];
    }

But this approach still take over 3 hrs to finish 1 comparsion (1 raw data file with 1 filter). And I have 5,000 raw data files and 40,000 filters to go!!! Is there any other better approach to handle this task ? Thanks.

Regds

LAM Chi-fung

Chi-fung LAM
  • 195
  • 2
  • 11
  • 1
    it would help if also the filter would be sorted, maybe you can add that as a preprocessing step – 463035818_is_not_an_ai May 03 '18 at 15:33
  • 7
    consider using bitfields -- you only have numbers that range from 0 to 49, so a 64 bit integer would be sufficient to encode all possible values per line. intersection would be bitwise AND. cardinality of intersection set would be [popcount](https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer). – MFisherKDX May 03 '18 at 15:44
  • 1
    I suggest you profile your code and work out what part exactly is your bottleneck. – UKMonkey May 03 '18 at 15:44
  • 3
    Dude! You've got at least 8e6 lines to compare with 6e6 others. It means 48e12 comparisons! Even if each comparison takes 1ns=1e-9s, it would take you 48e3s wich is about 13h. I'm surprised you managed to go as low as 3h! – YSC May 03 '18 at 15:48
  • 4
    also, this is highly parallelizable, so if you have a multi-core system you might consider this. – MFisherKDX May 03 '18 at 15:49
  • It's a minor OT detail, but *"6 // since 5 23 35 not found in filter line 3"* shouldn't it be **5**, instead (since 1,2,3,4,16 can be found in filter line 3)? – Bob__ May 04 '18 at 09:10

1 Answers1

-1

Not sure how well it'll work for your case (was hard to understand what you wanted from your description) but I've thought of the following algorithm:

Sort your long rows. It can be done in O(n), where n is the length of a single data row, by simple counting.

After that just for every number in filter row do a binary search on a sorted row. That'll be O(m * log(n)), where m is the number of filter rows. Should be a big improvement over your O(m*n) (you need to also multiply all those complexities by the number of the data rows, to be precise).

Also, pay attention to your I/O, after the algo updated it might become the next bottleneck (if you are using iostreams, don't forget to std::ios::sync_with_stdio(false).

Dan M.
  • 3,818
  • 1
  • 23
  • 41
  • If you are voting down, at least explain why in the comments. So far it's the only answer and there was no further communication from the author so I'm not sure whether I understood the question wrong or if it's something else. – Dan M. May 11 '18 at 17:25