14

What is the most efficient for speed algorithm to solve the following problem?

Given 6 arrays, D1,D2,D3,D4,D5 and D6 each containing 6 numbers like:

D1[0] = number              D2[0] = number      ......       D6[0] = number
D1[1] = another number      D2[1] = another number           ....
.....                       ....                ......       ....
D1[5] = yet another number  ....                ......       ....

Given a second array ST1, containing 1 number:

ST1[0] = 6

Given a third array ans, containing 6 numbers:

ans[0] = 3, ans[1] = 4, ans[2] = 5, ......ans[5] = 8

Using as index for the arrays D1,D2,D3,D4,D5 and D6, the number that goes from 0, to the number stored in ST1[0] minus one, in this example 6, so from 0 to 6-1, compare the ans array against each D array. The result should be 0 if one or more ans numbers are not found in any D at the same index and should be 1 if all ans numbers are found in some D at the same index. That is, return 0 if some ans[i] doesn't equal any DN[i] and return 1 if every ans[i] equals some DN[i].

My algorithm so far is:
I tried to keep everything unlooped as much as possible.

EML  := ST1[0]   //number contained in ST1[0]   
EML1 := 0        //start index for the arrays D 

While EML1 < EML
   if D1[ELM1] = ans[0] 
     goto two
   if D2[ELM1] = ans[0] 
     goto two
   if D3[ELM1] = ans[0] 
     goto two
   if D4[ELM1] = ans[0] 
     goto two
   if D5[ELM1] = ans[0] 
     goto two
   if D6[ELM1] = ans[0] 
     goto two

   ELM1 = ELM1 + 1

return 0     //If the ans[0] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers


two:

EML1 := 0      start index for arrays Ds 
While EML1 < EML
   if D1[ELM1] = ans[1] 
     goto three
   if D2[ELM1] = ans[1] 
     goto three
   if D3[ELM1] = ans[1] 
     goto three
   if D4[ELM1] = ans[1] 
     goto three
   if D5[ELM1] = ans[1] 
     goto three
   if D6[ELM1] = ans[1] 
     goto three
   ELM1 = ELM1 + 1

return 0    //If the ans[1] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

three:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[2] 
     goto four
   if D2[ELM1] = ans[2] 
     goto four
   if D3[ELM1] = ans[2] 
     goto four
   if D4[ELM1] = ans[2] 
     goto four
   if D5[ELM1] = ans[2] 
     goto four
   if D6[ELM1] = ans[2] 
     goto four
   ELM1 = ELM1 + 1

return 0   //If the ans[2] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

four:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[3] 
     goto five
   if D2[ELM1] = ans[3] 
     goto five
   if D3[ELM1] = ans[3] 
     goto five
   if D4[ELM1] = ans[3] 
     goto five
   if D5[ELM1] = ans[3] 
     goto five
   if D6[ELM1] = ans[3] 
     goto five
   ELM1 = ELM1 + 1

return 0 //If the ans[3] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers


five:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[4] 
     goto six
   if D2[ELM1] = ans[4] 
     goto six
   if D3[ELM1] = ans[4] 
     goto six
   if D4[ELM1] = ans[4] 
     goto six
   if D5[ELM1] = ans[4] 
     goto six
   if D6[ELM1] = ans[4] 
     goto six
   ELM1 = ELM1 + 1

return 0  //If the ans[4] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

six:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[5] 
     return 1            ////If the ans[1] number is not found in either D1[0-6].....  
   if D2[ELM1] = ans[5]      return 1 which will then include ans[0-6] numbers
     return 1
   if D3[ELM1] = ans[5] 
     return 1
   if D4[ELM1] = ans[5] 
     return 1
   if D5[ELM1] = ans[5] 
     return 1
   if D6[ELM1] = ans[5] 
     return 1
   ELM1 = ELM1 + 1

return 0 

As language of choice, it would be pure c

outis
  • 75,655
  • 22
  • 151
  • 221
Mark
  • 195
  • 2
  • 11
  • An "array" containing 1 number? What a waste. – kennytm May 05 '10 at 18:29
  • for creating algorithm you could decrease number of arrays to two instead of 6. – Andrey May 05 '10 at 18:29
  • 13
    I think your programming skills are very basic. It's very likely that what you want to do can be done much easier. Please write more about what you want to do with this code (what do the arrays represent and which information do you want to extract from them), this might clarify things and lead to more answers. – schnaader May 05 '10 at 18:38
  • 21
    Oh come on guys. For a first time user, he has clearly put a lot of effort in trying to format and phrase his question as good as possible. +1 – Lieven Keersmaekers May 05 '10 at 19:03
  • i don't like using goto in my programs, my professor was giving -10 points, Avoid using it. – berkay May 05 '10 at 19:06
  • 4
    Agree with Lieven... Even for beginners, we don't want anyone to feel uncomfortable about asking a question, especially a legit one even if educational / learning. How else does someone get to become a stronger developer without having real-world developers to network with. – DRapp May 05 '10 at 19:14
  • Ok, sorry about my skills, but I clearly put what I want to do, that was the algorithm implemented by assembler one compiled, I can not find a better way to implement it. – Mark May 05 '10 at 19:16
  • @mark That explains a lot. You should have said that was why your question was offering an algorithm as a description of the problem. – Pascal Cuoq May 05 '10 at 19:40
  • 9
    @mark: I'd like to apologize for the way that the Stackoverflow's police dept. has treated you. – Tom May 05 '10 at 19:43
  • Do not worry I'm not offended, what I meant is that I have implemented first in c, saw how assembler translated it and then put as algorithm, so I thought that was the best way of implemented it, assembler uses goto, so I dont see anything wrong with them, I tried with bool and break instead of goto, but did not work well – Mark May 05 '10 at 19:58
  • 2
    Could you explain what you want more? In the algorithm you give, I think only the first two loops can ever run, because in all of the loops, either the loop will end and the code will return, or the loop will hit a `goto two` and will go to the second one. Also, when you say "compare each res array against each D array", what should the program do with the comparisons? Do you want to print a series of strings "greater", "less than", etc., or do you want to quit if you hit numbers that are equal, or something else? – Noah Lavine May 05 '10 at 20:56
  • no all the loop are run if at least one number of the array ans is found in each while loop (while loop one,two,three, four, five, six) the return 0 means, exclude the number of ans[0],ans[1],ans[2],ans[3],ans[4],ans[5], while the return 1 means include those numbers – Mark May 05 '10 at 21:20
  • @noahlavine: Most probably OP used copy/past and forgot to change the goto target label. I belive it should be goto two in the first block, goto three in the second block, goto four in the third and so on. Could mark confirm this is the case (and edit his question accordingly) ? – kriss May 05 '10 at 21:36
  • no, that is not the case, two, is the second while loop, three is the third while loop and so on – Mark May 05 '10 at 21:46
  • @Mark: labels are OK, but most gotos are probably going to the wrong label. All gotos going to `two` the program will either faile (return 0) in the first block either enter an infinite loop. What would be logical is going yo block three when in block two, going in block four when in block three and so on... always goto going to next blocK – kriss May 05 '10 at 23:10
  • @Mark: the `goto two` issue is cleared up, though not fixed (but give it a moment), but you still haven't described what the results should be. The code sample implicitly defines results, but they may not match the intended ones. Also, you mention "res arrays" but don't define them. Fix these issues by editing the question (rather than replying in the comments–SO is for Q&A, not discussing). – outis May 06 '10 at 01:52
  • 1
    1) write a program in c. 2) study the compiled code in assembly. 3) rewrite the original c program in assembly way. the so-called fastest code is still 'suggested' by an optimizer. what's the point? – ohho May 06 '10 at 01:55
  • @mark: the description of the output doesn't match the sample code. The sample returns 1 iff ∀ 0 ≤ i < ST[0], ∃ 1 ≤ N ≤ 6 such that ans[i] = DN[i] (that is, all items in `ans` are an item of some `DN` with the same index). You have two different descriptions. The first says return 1 iff ∃ 1 ≤ N ≤ 6 such that ans[ST[0]-1] = DN[ST[0]-1] (that is, the last entry of `ans` is the last entry of some `DN`). – outis May 06 '10 at 05:41
  • ... The restatement ("Namely, ...") says to return 1 iff with k=ST[0], (∃ N such that ans[k-1] = DN[k-1]) and (∀ 0 ≤ i < ST[0]-1 and 1 ≤ N < k-1, ans[i] ≠ DN[i]) (that is, the last item of `ans` is the last item of some `DN` and the other items of `ans` don't match any item of any `DN` at the same index). Which of the three options is the desired one? – outis May 06 '10 at 05:42
  • sorry, I'm lost, what was the problem again? – fortran May 06 '10 at 06:58
  • @outis, sorry I made a mistake, I corrected it now – Mark May 06 '10 at 18:52

4 Answers4

7

I did a direct trivial C implementation of the algorithm provided by the original poster. It is here

As other proposed the first thing to do is to roll up the code. Unrolling is not really good for speed as it lead to code cache misses. I began by rolling inner loops and got this. Then I rolled the outer loop and removed the now useless gotos and got code below.

EDIT: I changed several time the C code because even as simple as it is there seems to be problems when JIT compiling or executing it with CUDA (and CUDA seems not to be very verbose about errors). That is why the piece of code below use globals... and that is just trivial implementation. We are not yet going for speed. It says much about premature optimization. Why bother to make it fast if we can't even make it work ? I guess there is still issues as CUDA seems to impose many restrictions on the code you can make work if I believe the Wikipedia article. Also maybe we should use float instead of int ?

#include <stdio.h>

int D1[6] = {3, 4, 5, 6, 7, 8};
int D2[6] = {3, 4, 5, 6, 7, 8};
int D3[6] = {3, 4, 5, 6, 7, 8};
int D4[6] = {3, 4, 5, 6, 7, 8};
int D5[6] = {3, 4, 5, 6, 7, 8};
int D6[6] = {3, 4, 5, 6, 7, 9};
int ST1[1] = {6};
int ans[6] = {1, 4, 5, 6, 7, 9};
int * D[6] = { D1, D2, D3, D4, D5, D6 };

/* beware D is passed through globals */
int algo(int * ans, int ELM){
    int a, e, p;

    for (a = 0 ; a < 6 ; a++){ 
        for (e = 0 ; e < ELM ; e++){
            for (p = 0 ; p < 6 ; p++){
                if (D[p][e] == ans[a]){
                    goto cont;
                }
            }
        }
        return 0; //bad row of numbers found
    cont:;
    }
    return 1;
}

int main(){
    int res;
    res = algo(ans, ST1[0]);
    printf("algo returned %d\n", res);
}

Now that is interesting, because we can understand what code is doing. By the way doing this packing job I corrected several oddities in the original question. I believe it was typos as it wasn't logical at all in the global context. - goto always jump to two (it should have progressed) - the last test checked ans[0] instead of ans[5]

please Mark, correct me if I'm wrong in the above asumptions on what the original code should do and your original algorithm is typo free.

What the code is doing it for each value in ans check it is present in a two dimentional array. If any number miss it returns 0. If all numbers are found it returns 1.

What I would do to get a real fast code is not to implement it in C but in another language like python (or C++) where set is a basic data structure provided by standard libraries. Then I would build a set with all the values of the arrays (that is O(n)) and check if numbers searched are present in set or not (that is O(1)). The final implementation should be faster than the existing code at least from an algorithmic point of view.

Python example is below as it is really trivial (print true/false instead of 1/0 but you get the idea):

ans_set = set(ans)
print len(set(D1+D2+D3+D4+D5+D6).intersection(ans_set)) == len(ans_set)

Here is a possible C++ implementation using sets:

#include <iostream>
#include <set>

int algo(int * D1, int * D2, int * D3, int * D4, int * D5, int * D6, int * ans, int ELM){
    int e, p;
    int * D[6] = { D1, D2, D3, D4, D5, D6 };
    std::set<int> ans_set(ans, ans+6);

    int lg = ans_set.size();

    for (e = 0 ; e < ELM ; e++){
        for (p = 0 ; p < 6 ; p++){
            if (0 == (lg -= ans_set.erase(D[p][e]))){
                // we found all elements of ans_set
                return 1;
            }
        }
    }
    return 0; // some items in ans are missing
}

int main(){
    int D1[6] = {3, 4, 5, 6, 7, 8};
    int D2[6] = {3, 4, 5, 6, 7, 8};
    int D3[6] = {3, 4, 5, 6, 7, 8};
    int D4[6] = {3, 4, 5, 6, 7, 8};
    int D5[6] = {3, 4, 5, 6, 7, 8};
    int D6[6] = {3, 4, 5, 6, 7, 1};

    int ST1[1] = {6};

    int ans[] = {1, 4, 5, 6, 7, 8};

    int res = algo(D1, D2, D3, D4, D5, D6, ans, ST1[0]);
    std::cout << "algo returned " << res << "\n";
}

We make some performance hypothesis : the content of ans should be sorted or we should construct it otherwise, we suppose that content of D1..D6 will change between calls to algo. Hence we do not bother constructing a set for it (as set construction is O(n) anyway we wouldn't gain anything if D1..D6 are changing). But if we call several times algo with the same D1..D6 and that is ans that change we should do the opposite and transform D1..D6 into one larger set that we keep available.

If I stick to C I could do it as follow:

  • copy all numbers in D1.. D6 in one unique array (using memcpy for each row)
  • sort content of this array
  • use a dichotomic search to check if number if available

As data size are really small here , we could also try going for micro optimizations. It could pay better here. Don't know for sure.

EDIT2: there is hard restrictions on the subset of C supported by CUDA. The most restrictive one is that we shouldn't use pointers to main memory. That will have to be taken into account. It explains why the current code does not work. The simplest change is probably to call it for every array D1..D6 in turn. To keep it short and avoid function call cost we may use a macro or an inline function. I will post a proposal.

kriss
  • 23,497
  • 17
  • 97
  • 116
  • I am not very familiar with C++, so I do not understand "where set is a basic data structure provided by standard libraries. Then I would build a set with all the values of the arrays (that is O(n)) and check if numbers searched are present in set or not (that is O(1))." – Mark May 05 '10 at 22:47
  • @Mark: I will post the python solution now as it is a piece of cake. It will give you the idea, the C++ one is not much more complicated. C would need more work if we do not have a set library immediately available (a library that implement the concept of set). – kriss May 05 '10 at 22:54
  • @Mark: OK, I got in comments you want to do it in CUDA. It clarify the context. C++ set could be interesting to try as you have a C++ compiler available. Say us if it is faster or slower than current C code, there is other possibilities not yet tried (like micro optimizing to avoid jumps). – kriss May 05 '10 at 23:24
  • @kriss: Unfortunately it did not work, maybe because the int * D[6] = { D1, D2, D3, D4, D5, D6 }; and the D[p][e] which is two dimension, while the above not – Mark May 05 '10 at 23:53
  • @Mark: no the above syntax is ok. Do you speak of the C++ or the C code. I didn't really tried the C code (hence it may contains typos), but I compiled and run the C++ one ans tested it with several values of D1..D6. I'm quite sure it's completely OK, but it may be compilers differences. What is the actual compiler message ? – kriss May 06 '10 at 00:07
  • @Mark: I corrected two typos in C program (line ST1 become `int ST1[1] = {6};` and line cont: become `cont:;`) – kriss May 06 '10 at 00:14
  • Well, because I've implemented it in CUDA, the error is of any use: Runtime API error: Unknown error – Mark May 06 '10 at 00:30
  • @Mark: if the C version works, it is most likely the C++ STL implementation that is bogus or incomplete. The C and C++ code I posted are really very simple. – kriss May 06 '10 at 00:51
  • @Mark: doesn't CUDA return the line number and the source file where the error occurred ? That should be enough to diagnose. – kriss May 06 '10 at 00:54
  • yes, unknown error, I know why, in the code is missing a return 1 if D1[ELM1] = ans[0] return 1 there must be two return 0, and one return 1 (this is when a reach 6) – Mark May 06 '10 at 01:06
  • if D1[ELM1] = ans[5] instead of having a goto, it should have a return 1, for all the other ans, goto – Mark May 06 '10 at 01:23
  • I tried to put, after one: if (a == 5) return 1 but it gives the error I mentioned earlier – Mark May 06 '10 at 01:31
  • @Mark: no need for the return. goto cont will continue with the next iteration of the loop `a` was 5 and become 6, the test is false and exit loop then execute return 1. And this is the only possible path to reach return 1, all other are catches by the return 0. No problem there (good try anyway). It you insert a special case for ans[5] you will just make your code fatter (hence slower) inside the loop. There must be something else. – kriss May 06 '10 at 01:40
  • if you check the original code, you will see that under each while loop (when it finish) there is a return 0, the return one is only when ans[5] is compared again D1,D2,D3,D4,D5,D6 and a match is found – Mark May 06 '10 at 01:43
  • It has to be: return 0; //bad row of numbers, if while ends cont:; return 1 if a[5] catch the goto cont. } return 0; } – Mark May 06 '10 at 01:48
  • sorry is my fault, I should have specified, the code return one only if after checking all the a numbers, the last check found a match. Namely, if after checked a[0],a[1],a[2],a[3],a[4] nothing is found and with a[5] a match is found then return 1 – Mark May 06 '10 at 01:51
  • @mark: 'Runtime Error' is the symptom of a C language support problem, not of a design error in the algorithm. In C code above the only feature used that is not plain old C from prehistorical times is the array initializer. I will change them in the sample C code by moving them to globals (and using globals is awfull but even faster than passing parameters). – kriss May 06 '10 at 06:41
  • @Mark: it is probably not all all a C or C++ issue not an algorithm issue at that point, but a CUDA specific one. The question should be retaged CUDA (I will do that if you don't mind) to attract CUDA experts. For them making it work (and really fast) should be a piece of cake. – kriss May 06 '10 at 07:01
  • CUDA pointers, unlike OpenCL ones, are actually transparent (in the C++ code, one has pointers to the addresses on the CUDA side; not dereferencable, but one is allowed to safely do pointer arithmetics in them and pass pointers to global memory to CUDA kernels). See CUDA programming guide. – the swine Apr 07 '14 at 09:49
1

I was a little bit confused by your question, but I think I have it enough to help you get started.

#define ROW 6
#define COL 6

int D[ROW][COL]; // This is all of your D arrays in one 2 dimensional array.

Next you should probably use nested for loops. Each loop will correspond to a dimension of D. Remember that the order of your indexes matters. The easiest way to keep it straight in C is to remember that D[i] is a valid expression even when D has more than one dimension (and would evaluate to a pointer to a row : a sub array).

If you can't change the independent D arrays to one multidimensional array you could easily make a pointer array whose members point to the heads of each of those arrays and achieve the same effect.

Then you can use the break statement to break out of the inner loop after you have determined that the current D[i] doesn't match the ans.

nategoose
  • 12,054
  • 27
  • 42
  • well I do not want to use 2 dimension array, I need to have 6 different array, and possible as unlooped as possible – Mark May 05 '10 at 19:42
  • What I am interested is to speed up, that code, which I put in algorithm form, and it has to be with 1 dimension arrays – Mark May 05 '10 at 20:29
  • If you compiled it with a compiler that was capable of loop unrolling with that optimization turned on then it would likely result in something similar to what you were trying to achieve with gotos without upsetting your instructor. – nategoose May 05 '10 at 20:59
  • 1
    @mark: also gotos have another effect than upsetting instructors, they empty processor pipe-lines (that's true of any jump, so in this regard loops are no better). But whenever you can express your program without using any kind of goto/branch/jump even if you execute more instructions than strictly necessary you speedup things. I believe there is a track to follow there (will give it a try). – kriss May 05 '10 at 21:21
  • Well I tried not to use goto and use bool variables and break, but it did not work, and besides that means more memory. Is goto real a problem in programming or is it a matter of form? – Mark May 05 '10 at 21:25
  • @mark: Just because you declare a variable doesn't mean that the compiler has to use up memory for it. Modern compilers are very smart. Goto is a problem because using labels and goto often results in unreadable code. If the reader of your code has to constantly look for labels of gotos to understand the code then they are horrible. I'm not totally against them, but I'm not who you have to impress. I'm fairly certain that if you code this in C with loops but compile with high optimization you will end up with good code. If this is in a function then you may want to use the restrict keyword. – nategoose May 05 '10 at 21:53
  • @mark: Another thing you could try that would fit what you have and technically not have "goto" in it would be to put all of this in a `do{}while(0)` and let break be your goto. Or `if (D1[ELM1] != ans[0] && D2[ELM1] != ans[0] && ...) { ELM1++; return 0;}` Also, you should not that with code that does a lot of jumping already trying to avoid loops doesn't save you what you as much as with non-jumpy code. – nategoose May 05 '10 at 22:01
  • Well, I must specify, that I have to implement that in CUDA, and that means more registers if more variables. I tried D1[elm1] != ans[0] && D2[elm1] != ans[0] && ...., but it is slower – Mark May 05 '10 at 22:09
  • I tried lot of combination, some did not work and some were slower than the actual implementation – Mark May 05 '10 at 22:10
  • You should have said CUDA before. That changes things and explains your otherwise confusing rules. Try `if ((D1[ELM1] - ans[0]) & (D2[ELM1] - ans[0]) & ... ) { ELM1++; return 0; }` If it weren't for the return I'd also suggest _trying_ `x=((D1[ELM1]-ans[0]) & ((D1[ELM1]-ans[0])) & ...); ELM1 += x/x;` I suggested the self divide to someone else on here doing GPU work as a possible jumpless way of getting 1 as long as 0/0 results in 0 or a trap, and they said that it worked for them. – nategoose May 06 '10 at 12:57
0

In case the range of the numbers is limited, it would be probably easier to make a bit array, like this:

int IsPresent(int arrays[][6], int ans[6], int ST1)
{
    uint32_t bit_mask = 0;
    for(int i = 0; i < 6; ++ i) {
        for(int j = 0; j < ST1; ++ j) {
            assert(arrays[i][j] >= 0 && arrays[i][j] < 32); // range is limited
            bit_mask |= 1 << arrays[i][j];
        }
    }
    // make a "list" of numbers that we have

    for(int i = 0; i < 6; ++ i) {
        if(((bit_mask >> ans[i]) & 1) == 0)
            return 0; // in ans, there is a number that is not present in arrays
    }
    return 1; // all of the numbers were found
}

This will always run in O(6 * ST1 + 6). Now this has the disadvantage of first going through up to 36 arrays and then checking against six values. If there is a strong precondition that the numbers will be mostly present, it is possible to reverse the test and provide an early exit:

int IsPresent(int arrays[][6], int ans[6], int ST1)
{
    uint32_t bit_mask = 0;
    for(int i = 0; i < 6; ++ i) {
        assert(ans[i][j] >= 0 && ans[i][j] < 32); // range is limited
        bit_mask |= 1 << ans[i];
    }
    // make a "list" of numbers that we need to find

    for(int i = 0; i < 6; ++ i) {
        for(int j = 0; j < ST1; ++ j)
            bit_mask &= ~(1 << arrays[i][j]); // clear bits of the mask

        if(!bit_mask) // check if we have them all
            return 1; // all of the numbers were found
    }

    assert(bit_mask != 0);
    return 0; // there are some numbers remaining yet to be found
}

This will run at most in O(6 * ST1 + 6), at best in O(6 + 1) if the first number in the first array covers all of ans (and ans is six times the same number). Note that the test for bit mask being zero can be either after each array (as it is now) or after each element (that way involves more checking but also earlier cutoff when all the numbers are found). In context of CUDA, the first version of the algorithm would likely be faster, as it involves fewer branches and most of the loops (except the one for ST1) can be automatically unrolled.

However, if the range of the numbers is unlimited, we could do something else. Since there are only up to 7 * 6 = 42 different numbers in ans and all the arrays, it would be possible to map those to 42 different numbers and use a 64-bit integer for a bit mask. But arguably this mapping of numbers to integers would already be enough for the test and it would be possible to skip this test altogether.

Another way to do it would be to sort the arrays and simply count coverage of the individual numbers:

int IsPresent(int arrays[][6], int ans[6], int ST1)
{
    int all_numbers[36], n = ST1 * 6;
    for(int i = 0; i < 6; ++ i)
        memcpy(&all_numbers[i * ST1], &arrays[i], ST1 * sizeof(int));
    // copy all of the numbers into a contiguous array

    std::sort(all_numbers, all_numbers + n);
    // or use "C" standard library qsort() or a bitonic sorting network on GPU
    // alternatively, sort each array of 6 separately and then merge the sorted
    // arrays (can also be done in parallel, to some level)

    n = std::unique(all_numbers, all_numbers + n) - all_numbers;
    // this way, we can also remove duplicate numbers, if they are
    // expect to occur frequently and make the next test faster.
    // std::unique() actually moves the duplicates to the end of the list
    // and returns an iterator (a pointer in this case) to one past
    // the last unique element of the list - that gives us number of
    // unique items.

    for(int i = 0; i < 6; ++ i) {
        int *p = std::lower_bound(all_numbers, all_numbers + n, ans[i]);
        // use binary search to find the number in question
        // or use "C" standard library bfind()
        // or implement binary search yourself on GPU

        if(p == all_numbers + n)
            return 0; // not found
        // alternately, make all_numbers array of 37 and write
        // all_numbers[n] = -1; before this loop. that will act
        // as a sentinel and will save this one comparison (assuming
        // that there is a value that is guaranteed not to occur in ans)

        if(*p != ans[i])
            return 0; // another number found, not ans[i]
        // std::lower_bound looks for the given number, or for one that
        // is greater than it, so if the number was to be inserted there
        // (before the bigger one), the sequence would remain ordered.
    }

    return 1; // all the numbers were found
}

This will run in O(n) for copying, O(36 log 36) for sorting, optionally O(n) for unique (where n is 6 * ST1) and O(n log n) for searching (where n can be less than 6 * ST1 if unique is employed). The whole algorithm therefore runs in linearithmic time. Note that this does not involve any dynamic memory allocation and as such is suitable even for GPU platforms (one would have to implement sorting and port std::unique() and std::lower_bound(), but all those are farily simple functions).

the swine
  • 10,713
  • 7
  • 58
  • 100
0

With only 36 values to compare, the most efficient approach would be to not use CUDA at all.

Just use a CPU loop.

If you change your inputs, I'll change my answer.

Danny Varod
  • 17,324
  • 5
  • 69
  • 111