Finding divisor pairs

Question

I'm trying to solve this exercise http://main.edu.pl/en/archive/amppz/2014/dzi and I have no idea how to improve perfomance of my code. Problems occure when program have to handle over 500,000 unique numbers(up to 2,000,000 as in description). Then it took 1-8s to loop over all this numbers. Tests I have used are from http://main.edu.pl/en/user.phtml?op=tests&c=52014&task=1263, and I testing it by command
program.exe < data.in > result.out

Description: You are given a sequence of _n integer a₁, a₂, ... a_n. You should determine the number of such ordered pairs(_i, _j), that _i, _j equeals(1, ..., _n), _i != _j and a_i is divisor of a_j. The first line of input contains one integer _n(1 <= _n <= 2000000) The second line contains a sequence of _n integers a₁, a₂, ..., a_n(1 <= a_i <= 2000000). In the first and only line of output should contain one integer, denoting the number of pairs sought. For the input data: 5 2 4 5 2 6 the correct answer is: 6 Explanation: There are 6 pars: (1, 2) = 4/2, (1, 4) = 2/2, (1, 5) = 6/2, (4, 1) = 2/2, (4, 2) = 4/2, (4, 5) = 6/2.

For example:
- with 2M in total numbers and 635k unique numbers, there is 345mln iterations in total
- with 2M in total numbers and 2mln unqiue numbers, there is 1885mln iterations in total

#include <iostream>
#include <math.h>
#include <algorithm>

#include <time.h>


#define COUNT_SAME(count) (count - 1) * count


int main(int argc, char **argv) {
    std::ios_base::sync_with_stdio(0);

    int n; // Total numbers
    scanf("%d", &n);

    clock_t start, finish;
    double  duration;

    int minVal = 2000000;
    long long *countVect = new long long[2000001]; // 1-2,000,000; Here I'm counting duplicates

    unsigned long long counter = 0;
    unsigned long long operations = 0;

    int tmp;
    int duplicates = 0;

    for (int i = 0; i < n; i++) {
        scanf("%d", &tmp);

        if (countVect[tmp] > 0) { // Not best way, but works
            ++countVect[tmp];
            ++duplicates;
        } else {
            if (minVal > tmp)
                minVal = tmp;

            countVect[tmp] = 1;
        }
    }

    start = clock();

    int valueJ;
    int sqrtValue, valueIJ;
    int j;

    for (int i = 2000000; i > 0; --i) {
        if (countVect[i] > 0) { // Not all fields are setted up
            if (countVect[i] > 1) 
                counter += COUNT_SAME(countVect[i]); // Sum same values

            sqrtValue = sqrt(i);

            for (j = minVal; j <= sqrtValue; ++j) {
                if (i % j == 0) {
                    valueIJ = i / j;

                    if (valueIJ != i && countVect[valueIJ] > 0 && valueIJ > sqrtValue)
                        counter += countVect[i] * countVect[valueIJ];

                    if (i != j && countVect[j] > 0)
                        counter += countVect[i] * countVect[j];
                }

                ++operations;
            }
        }
    }

    finish = clock();
    duration = (double)(finish - start) / CLOCKS_PER_SEC;
    printf("Loops time: %2.3f", duration);
    std::cout << "s\n";
    std::cout << "\n\nCounter: " << counter << "\n";
    std::cout << "Total operations: " << operations;

    std::cout << "\nDuplicates: " << duplicates << "/" << n;
    return 0;
}

I know, I shouldn't sort the array at beginning, but I have no idea how to make it in better way.

Any tips will be great, thanks!

Here is improved algorithm - 2M unique numbers within 0.5s. Thanks to @PJTraill!

#include <iostream>
#include <math.h>
#include <algorithm>

#include <time.h>


#define COUNT_SAME(count) (count - 1) * count


int main(int argc, char **argv) {
    std::ios_base::sync_with_stdio(0);

    int n; // Total numbers
    scanf("%d", &n);

    clock_t start, finish;
    double  duration;

    int maxVal = 0;
    long long *countVect = new long long[2000001]; // 1-2,000,000; Here I'm counting duplicates

    unsigned long long counter = 0;
    unsigned long long operations = 0;

    int tmp;
    int duplicates = 0;

    for (int i = 0; i < n; i++) {
        scanf("%d", &tmp);

        if (countVect[tmp] > 0) { // Not best way, but works
            ++countVect[tmp];
            ++duplicates;
        } else {
            if (maxVal < tmp)
                maxVal = tmp;

            countVect[tmp] = 1;
        }
    }

    start = clock();

    int j;
    int jCounter = 1;

    for (int i = 0; i <= maxVal; ++i) {
        if (countVect[i] > 0) { // Not all fields are setted up
            if (countVect[i] > 1)
                counter += COUNT_SAME(countVect[i]); // Sum same values

            j = i * ++jCounter;

            while (j <= maxVal) {
                if (countVect[j] > 0)
                    counter += countVect[i] * countVect[j];

                j = i * ++jCounter;
                ++operations;
            }

            jCounter = 1;
        }
    }

    finish = clock();
    duration = (double)(finish - start) / CLOCKS_PER_SEC;
    printf("Loops time: %2.3f", duration);
    std::cout << "s\n";
    std::cout << "\n\nCounter: " << counter << "\n";
    std::cout << "Total operations: " << operations;

    std::cout << "\nDuplicates: " << duplicates << "/" << n;
    return 0;
}

Please include relevant information in the question, that is, please add the exercise itself, so people will know what the code is about without having to visit a link which potentially will be broken in the future. — GolezTrol, May 29 '15 at 21:35
There's a code review Stack Exchange site. This would fit there better — Sami Kuhmonen, May 29 '15 at 21:36
No need to sort them at all, just do this: `for i = 1..n; for j = i+1..n; if j | i or i | j; count++`. Requires `O(n^2 / 2) = O(n^2)` work. I doubt you can get it done faster, don't really see the advantage of pre-sorting here (of course I could be wrong). — ShellFish, May 29 '15 at 21:40
@ShellFish Let me get that straight, you claim (n^2)/2 ~= n/2? did you try to see if this claim holds for n~=1000? — amit, May 29 '15 at 21:44
@AmiTavory @amit Yes I'm sorry I'm dyslectic and read the `/` as a `^`, my sincere apologies, you are obviously right :) — ShellFish, May 29 '15 at 21:44
@ShellFish No need to apologize. We're just exchanging thoughts. Finding mistakes is part of the point. — Ami Tavory, May 29 '15 at 21:48
You sound as if you are sure your code is too slow: do you know that, and if so how? Is there a target or time limit? I see none at your link. — PJTraill, May 29 '15 at 22:07
Why are you using shellsort and not something better (and preferably built in)? — amit, May 29 '15 at 22:11
@PJTraill All these tests should pass within specific delay, which is usually one second at max. Also this is exercise task from Polish Collegiate Programming Contest. — , May 29 '15 at 22:11
@amit What do you mean? I have tried qsort, but it's about 50% slower. — , May 29 '15 at 22:12
@Synchro Did you try [std::sort()](http://www.cplusplus.com/reference/algorithm/sort/). People have worked hard to optimize it as much as they can, and it's bug free. Why reinvent the wheel? Was it slower for you? — amit, May 29 '15 at 22:14
@Synchro Or even better, your `countVect` is already doing the first phase of count sort, doing the 2nd one to generate the sorted list is trivial. — amit, May 29 '15 at 22:20
If you’ve only got 1 sec, 0.97 sec for sorting doesn’t leave you much for anything else. So perhaps you don’t want to sort at all — doesn’t `countVect` contain all the information you need? And do you need `long long` to hold 2,000,000? Also, tests like `valueIJ != valueI` and `j != valueI` succeed almost every time, so you could allow for them differently. (I also don’t remember: I take it `new long[2000001]` sets it all to 0) — PJTraill, May 29 '15 at 22:21
Yeah, thats right, I don't have to sort it second way. @PJTraill yep, just forgot about it. I have many times edited this code. But still the whole program should count these numbers within one second. So it's only part of it. I don't think I can optimize it better with this algorithm. — , May 29 '15 at 22:33
@PJTraill Ahh, however valueIJ != valueI and j != valueI is necessary here. In some cases this condition will not pass. — , May 29 '15 at 23:06
@Synchro: What I mean is that one can remove these test from the loop in most cases and handle them separately. Choose loop limits cleverly or have multiple versions of the loop — but it is a fairly small optimisation. — PJTraill, May 30 '15 at 10:44

PJTraill · Accepted Answer · 2015-05-31T10:37:56.267

I expect the following to work a lot faster than the OP’s algorithm (optimisations oblique):

(The type of values and frequencies should be 32-bit unsigned, counts 64-bit – promote before calculating a count, if your language would not.)
Read the number of values, N.
Read each value v, adding one to its frequency freq[v] (no need to store it).
- (freq[MAX] (or MAX+1) can be statically allocated for probably optimal initialisation to all 0)
Calculate the number of pairs involving 1 from freq[1] and the number of values.
For every i in 2..MAX (with freq[i] > 0):
- Calculate the number of pairs (i,i) from freq[i].
- For every multiple m of i in 2m..MAX:
  - (Use m as the loop counter and increment it, rather than multiplying)
  - Calculate the number of pairs (i,m) from freq[i] and freq[m].
- (if freq[i] = 1, one can omit the (i,i) calculation and perform a variant of the loop optimised for freq[i] = 1)
(One can perform the previous (outer) loop from 2..MAX/2, and then from MAX/2+1..MAX omitting the processing of multiples)

The number of pairs (i,i) = _freq[i]C₂ = ( freq[i] * (freq[i] - 1) ) / 2 .
The number of pairs (i,j) = freq[i] * freq[j] for i ≠ j.

This avoids sorting, sqrt and division.

Other optimisations

One can store the distinct values, and scan that array instead (the order does not matter); the gain or loss due to this depends on the density of the values in 1..MAX.

If the maximum frequency is < 2¹⁶, which sounds very probable, all products will fit in 32 bits. One could take advantage of this by writing functions with the numeric type as a template, tracking the maximum frequency and then choosing the appropriate instance of the template for the rest. This costs N*(compare+branch) and may gain by performing D² multiplications with 32 bits instead of 64, where D is the number of distinct values. I see no easy way to deduce that 32 bits suffice for the total, apart from N < 2¹⁶.

If parallelising this for n processors, one could let different processors process different residues modulo n.

I considered keeping track of the number of even values, to avoid a scan of half the frequencies, but I think that for most datasets within the given parameters that would yield little advantage.

@Synchro: I see you have accepted this, but I should point out one important correction: my original statement that all types were 32-bit is, of course wrong and I have corrected it. — PJTraill, May 30 '15 at 13:43
Yes, but your idea about multipling values was correct and was good enough to pass followed tests. Micro-optimisations isn't important here. It isn't hard task to do, but these kind of tasks needs from programmer to think in different way. Above code is not first version of my algorithm. First one was very slow, but I haven't think way you are. — , May 30 '15 at 21:57

score 0 · Answer 2 · answered May 29 '15 at 22:23

Ok, I am not going to write your whole algorithm for you, but it can definitely be done faster. So i guess this is what you need to get going:

So you have your list sorted, so there are a lot of assumptions you can make from this. Take for instance the highest value. It wont have any multiples. The highest value that does, will highest value divided by two.

There is also one other very usefull fact here. A multiple of a multiple is also a multiple. (Still following? ;)). Take for instance the list [2 4 12]. Now you've found (4,12) as a multiple pair. If you now also find (2,4), then you can deduce that 12 is also a multiple of 2. And since you only have to count the pairs, you can just keep a count for each number how many multiples it has, and add that when you see that number as a multiple itself. This means that it is probably best to iterate your sorted list backwards, and look for divisors instead.

And maybe store it in some way that goes like [ (three 2's ), (two 5's), ...] ie. store how often a number occurs. Once again, you don't have to keep track of it's id, since you only need to give them the total number of pairs. Storing your list this way helps you, because all the 2's are going to have the same amount of multiples. So calculate once and then multiply.

Did you read his code? The last part of your answer is already done in his code. The first part is unclear how it could be translated to code. — amit, May 29 '15 at 22:25
Whoops, missed that, but i guess i'll go and write some code now ;) — David van rijn, May 29 '15 at 22:31

Finding divisor pairs

2 Answers2

Other optimisations