how to make the execution time less(ie. a faster code) for this problem

Question

this question is from Codechef.com [if anyone is still solving this question dont look further into the post before trying yourself] and although it runs right, but i need to make it a lot faster.i am a beginner in c,c++.(i know stuff upto arrays,strings and pointers but not file handling etc).So is there a way to make this program run any faster without making it complex(its ok if its complex in algorithm). i will accept complex coding too if u mention which book u followed where it is all given :) . i currently am following Robert Lafore. Here is the program:-

There are N numbers a[0],a[1]..a[N - 1]. Initally all are 0. You have to perform two types of operations :

1) Increase the numbers between indices A and B by 1. This is represented by the command "0 A B"

2) Answer how many numbers between indices A and B are divisible by 3. This is represented by the command "1 A B".

Input :

The first line contains two integers, N and Q. Each of the next Q lines are either of the form "0 A B" or "1 A B" as mentioned above.

Output :

Output 1 line for each of the queries of the form "1 A B" containing the required answer for the corresponding query.

Sample Input :

Sample Output :

Constraints :

1 <= N <= 100000
1 <= Q <= 100000
0 <= A <= B <= N - 1

HERE IS MY SOLUTION:-

#include<stdio.h>

int main()
{
    unsigned int n; //amount of numbers taken
    scanf("%u",&n);

    unsigned int arr[n],q,a,b,count=0,j,i;
    short int c;
    scanf("%u",&q);//cases taken  
    for(i=0;i<n;++i)
    {
      arr[i]=0;
    }    


    for(i=0;i<q;++i)
    {
      scanf("%d%u%u",&c,&a,&b);//here a,b are A and B respectively and c is either 
                               //o or 1

      if(c==0)
      {
        for(j=a;j<=b;++j)
          arr[j]++;
      }


      if(c==1)
      {
        for(j=a;j<=b;++j)
        {
          if(arr[j]%3==0)
            count++;
        }
       printf("%u\n",count);
      }
      count=0;
    }  

}

Why do I have this feeling that the same question came up not more than a week back? — dirkgently, Sep 05 '10 at 10:25
no, as i said its a problem from codechef.com.and it first came on sept1 on codechef. so it could not have come more than a week back — gamma, Sep 05 '10 at 10:25
@gamma: Someone else had posted it I'm sure. And this is a very common problem. A lot of other programming competition websites would surely have this problem. — dirkgently, Sep 05 '10 at 10:31
This code is only valid when ISO C99 compilation is used (which precludes using VC++). I would at suggest least making it more universally portable. — Clifford, Sep 05 '10 at 10:31
@Clifford. If microsoft can't be bothered to support a 10+ year old standard, should that really hold back the rest of the world? — aaronasterling, Sep 05 '10 at 10:45
At last! Someone else (not just me) voted to close as too localized! — P Shved, Sep 05 '10 at 11:07
@aaronasterling: If you post a question and want the largest possible audience to be able to investigate it; perhaps yes! That would be to the benefit of the OP perhaps. — Clifford, Sep 05 '10 at 11:07
@Martin: "The complexity is O(n). You can *NOT* really get better than that." ;) @Pavel: the question isn't worse or better than an average C/C++ question over here at SO. — Dummy00001, Sep 05 '10 at 11:37
@Clifford: Correct me if I'm wrong, but doesn't VC++ support VLAs as an extension? This seems to be a popular extension in C++ compilers. — slacker, Sep 05 '10 at 21:37
@slacker: I am correcting you ;) The accepted answer at http://stackoverflow.com/questions/146381/visual-studio-support-for-new-c-c-standards is germane to this issue. There is also the issue declaration placement that is C99 specific. — Clifford, Sep 06 '10 at 08:43

Dummy00001 · Accepted Answer · 2010-09-05T11:19:04.797

2

1) Increase the numbers between indices A and B by 1. This is represented by the command "0 A B"

2) Answer how many numbers between indices A and B are divisible by 3. This is represented by the command "1 A B".

Initially numbers are 0 and thus are divisible by 3. Increment by one make the number non-divisible. Next increment - number still non-divisible. Third increment makes the number again divisible.

First optimization one can try is to not to let number grow above 2: if during increment number goes from 2 to 3, set it back to zero. Now search for the range become a simple comparison with 0. (That way array would contain instead of the number its modulo 3.)

Second optimization is to use extents instead of plain array, e.g. something similar to the RLE: collapse into a range all adjacent numbers with the same divisibility. Instead of numbers, the array would contain structures like that:

struct extent {
   int start; /* 0 .. N-1; extent should have at least one number */
   int end;   /* 0 .. N   */
   int n;     /* 0, 1, 2; we are only interested in the result of % */
};

Initially the array would contain single extent covering the all numbers {0, N, 0}. During increment step, a range might be split or joined with adjacent one. That representation would speed-up the counting of numbers as you would go over the array not one-by-one but in chunks. (It would still degrade to the linear search if all ranges are contain only one element.)

Another approach could be to use instead of the array three sets with indeces. Set #0 would contain all the indeces of numbers whose modulo 3 is 0, set #1 - 1, set #2 - 2. Since during increment operation, we need to do a search, instead of std::set it is better to use e.g. std::bitset with every bit marking the index of the number belonging to the set.

N.B. This way we do not keep the original numbers at all. We implicitly keep only the result of modulo 3.

During increment, we need to find which set the index belongs to, e.g. set #n, and move the index to the next (mod 3) set: set the bit to zero in the set n, set the bit to 1 in the set n + 1 (mod 3). Counting numbers divisible by 3 now ends up being as simple as counting the non-zero bits in the set #0. That can be accomplished by creating a temp std::bitset as a mask with bits in range [A,B] set to one, masking with the temp set the set #0 and calling std::bitset::count() on the resulting bitset.

edited Sep 05 '10 at 11:19

answered Sep 05 '10 at 11:06

Dummy00001

16,630
5
41
63

Or you can use both approaches and select the best one at runtime: Have an extent structure, with two types of extents - a "flat" extent (the one you described, all elements share a common modulus), and an "array" extent, where the elements have different moduli which are stored for each element separately. You start with a single "flat" extent covering the entire range, and when you get a sequence of very short "flat" extents, you convert it to a single "array" extent. This should improve the worst-case performance considerably. – slacker Sep 05 '10 at 12:06
1

You can optimize the "array" approach by exploiting the fact that moduli fit into two bits each - so store 16 of them in a single 32-bit integer. Add `0x55555555` to the integer to increment all of moduli in a single operation. Keep the result in range with `array &= ~( (array >> 1 & array & 0x55555555) * 3)`. – slacker Sep 05 '10 at 12:17
Actually, in my head, I was thinking more about getting away from the `O(n)` complexity. [Bit tricks are plenty.](http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/) Question is how to break the worst case of `O(n)` on count and increment operations, say to have it at `O(log(n))`. But it appears to be impossible. – Dummy00001 Sep 05 '10 at 12:44
@gamma: trick is to think of the data from perspective of your task - and available algorithms. learn math (boring stuff like applied algebra or math analysis) and tricks of adjusting your mind to the task become quite easy. For extents you might want to check the [B-trees](http://en.wikipedia.org/wiki/B-tree) or [tries](http://en.wikipedia.org/wiki/Trie). – Dummy00001 Sep 05 '10 at 13:00
if i am not wrong u learn this in data structures right? Damn i knew i should have taken computer science.......But still electronics is better(i can do it bothways).thanks for the suggestions, now i know what i am missing. – gamma Sep 05 '10 at 13:10
@Dummy00001: Wait... it **IS** possible to do increments and counts in O(log(N)) - both best and worst case! I'm a genius! :D MWAHAHAHAHAHA!!! – slacker Sep 05 '10 at 13:19
I'll come back later to describe the approach I just came up with. For now, I have a few urgent things to do. You know, Real Life, that thing we read about on Slashdot from time to time (between unicorn and Yeti stories). – slacker Sep 05 '10 at 13:21
@gamma: I haven't learned that. I studies applied math and system programming mostly. That with decade+ years of experience in software development gives some flexibility of coming up with random algorithms on a whim. And it's not that I did something special here: it is simply overoptimized version of your brute-force algorithm, with the same asymptotic `O(n)` performance but simply with lower run-time overhead. – Dummy00001 Sep 05 '10 at 13:42

adamk · Answer 2 · 2010-09-05T15:05:40.410

2

Two very simple optimizations:

You could really only store the value modulo 3 in the array, and not the actual value.

The increment could be done by a simple lookup table (to avoid comparisons and branches):

char increment[3] = { 1, 2 ,0 };
new_val = increment[old_val];

Testing for 3-divisibility is now comparing to 0 - which is significantly faster than integer division.

edited Sep 05 '10 at 15:05

answered Sep 05 '10 at 11:10

adamk

45,184
7
50
57

I think modern CPUs would prefer `new_val = (old_val + 1) == 3 ? 0 : (old_value+1)` instead of the extra lookup table. But anyhow you want to make the lookup table `static` so that it will not be on the stack but in the data segment. – Dummy00001 Sep 05 '10 at 14:46
@Dummy00001 - not sure why modern CPUs would prefer an arithmetic operation and a branch over a cached memory read. Regarding `static` - you're probably right, although as its only 3 bytes it's really not crucial. – adamk Sep 05 '10 at 15:05
Firstly, arithmetic operations are faster than cached memory reads - *especially* if the address is not known well in advance. Secondly, with a good compiler there is no branch here - just one conditional move. – slacker Sep 05 '10 at 15:42

aaronasterling · Answer 3 · 2010-09-05T10:50:24.367

1

One improvement that you can make is to replace

if (c == 0) {
    //code here
}

if (c == 1) {
   // code here
}

with:

if (c == 0) {
   //...
} else if (c == 1) {
  //...
}

if you know for sure that c will always be 0 or 1, you can also replace the else if with a simple else.

What's really slowing you down is the I/O. If it's a big enough list, it might pay to malloc enough memory to hold the input and output. Then collect all the input before you enter the loop and display the output at the end.

edited Sep 05 '10 at 10:50

answered Sep 05 '10 at 10:35

aaronasterling

68,820
20
127
125

Would using streaming IO classes (cin) help with IO performance? I've no experience in comparing them. – JBRWilkinson Sep 05 '10 at 11:06
Though most modern compilers should be smart enough to figure the `else if` out itself and do the optimization automatically. – Grant Peters Sep 05 '10 at 11:08
@JBRWilkinson No. Maybe a little but only very marginally. IO is slow _however_ you do it. @Grant Peters. True but the `else if` also makes it more explicit to people reading the code. The IO is the big holdup here anyways. – aaronasterling Sep 05 '10 at 11:14
else or else if is not helping, my time still exceeds the limit.can u tell me how to collect all input before entering the loop and displaying output at end.is it related to buffer or stuff? – gamma Sep 05 '10 at 11:20

score 0 · Answer 4 · answered Sep 05 '10 at 10:31

0

From what I can see, your code is as optimized as it will ever be.

-Alex

answered Sep 05 '10 at 10:31

Alexander Rafferty

6,134
4
33
55

For a naïve algorithm, that is ;). – slacker Sep 05 '10 at 15:54

score 0 · Answer 5 · answered Sep 05 '10 at 10:34

0

Your solution seems fine I think apart from lacking any bounds checking. Maybe use an 'else' when checking 'c' or a switch, but this will save you trivial amounts of time. I don't think you'll find something as useless as this in any book.

answered Sep 05 '10 at 10:34

James

9,064
3
31
49

score 0 · Answer 6 · answered Sep 05 '10 at 10:34

This looks pretty performant to me. One thing I see is that you're using variable length arrays, this is OK for C (AFAIK) but illegal in C++, for C++ you will need to either use an std::vector or new up the array.

The only place I can see you improving your performance is by using Duff's Device for partial loop unrolling which I wouldn't recommend for a toy sample.

score 0 · Answer 7 · answered Sep 05 '10 at 14:01

This is pretty complicated, but stay with me. I can't cite any specific place I got this from except "27 years of coding experience."

The original problem has the number line set to natural integers 0,1,2,3,4,5,6... However, we only care about numbers that are divisible by 3, so let's redefine our number line to have only three values in it: {2,3,4} and remap the number line such that:

0 => 4
1 => 2
2 => 3
3 => 4
4 => 2
5 => 3
6 => 4
.. and so on.

You'll notice that numbers that are divisible by 3 are mapped to 4 in our sequence. Why use {2,3,4}? 4 is 100 in binary, which means any element with its 3rd bit set will be divisible by 3. This is easy to test with bit operations.

Since we're using 2,3,4 as the trinary sequence, we can reduce our array element size to 4-bits. We'll define the array as 8-bit values, but half the size as we need in bytes (plus 1 in case it is an odd-sized array), and store the elements two per byte in the array. The adds and compares can be done as SIMD (single instruction, multiple data) operations, incrementing or checking up to 16 elements per loop iteration by using some clever bit operations.

That's the concept. Now down to the code.

First, we need to allocate and initialize our array.

unsigned char *arr = malloc(n/2 + 1);

// Init all element values to 4:
memset(&arr, 0x44, n/2 + 1);

We're going to increment 16 elements at a time by casting a block of 8 array bytes to a uint_64, add 0x1111111111111111 and then skip to the next block. Repeat with 32-bit, 16-bit, 8-bit and 4-bit math for up to 8, 4, 2 or 1 remaining at the end of the operation.

Before each increment, anything that has a value of 4 needs to be decremented by 3 before the increment to keep the numbers in the proper position.

Here's the code (untested) for the increment command:

/**
   @param p
      p is the address of the byte with the first aligned element to be incremented, &arr[A/2] when A is even, &arr[A/2]+1 when A is odd.
   @param j
      j is the number of elements to increment.  (B-A) when A is even, (B-A-1) when A is odd.
 */ 
void increment_aligned_block(unsigned char *p, int j)
    uint64_t fours;

    while (j>16) {
       // Find the ones that are value 4
       fours = *p & 0x4444444444444444;
       // Decrement each that matches by 3
       *p -= (fours >> 1 | fours >> 2);

       // Add 1 to each of the 16 array elements in the block.
       (uint64_t)(*p) += 0x1111111111111111;
       p += 8; j -= 16;
    }
    if (j >= 8) {
        // repeat the above for 32-bits (8 elements)
        // left as an exercise for the reader.
        p += 4; j -= 8;
   }
    if (j >= 4) {
        // repeat the above for 16-bits (4 elements)
        // left as an exercise for the reader.
        p += 2; j -= 4;
    }
    if (j >= 2) {
        // repeat the above for 8-bits (2 elements)
        // left as an exercise for the reader.
        p += 1; j -= 2;
    }
    if (j == 1) {
        // repeat the above for 8-bits (1 elements)
        // left as an exercise for the reader.
    }
}

For the compare use:

/**
   @param p
      p is the address of the byte with the first aligned element to be counted, &arr[A/2] when A is even, &arr[A/2]+1 when A is odd.
   @param j
      j is the number of elements to count.  (B-A) when A is even, (B-A-1) when A is odd.
 */ 
int count_aligned_block(unsigned char *p, int j)
    int count = 0;
    uint64_t divisible_map;

    while (j > 16) {
        // Find the values of 4 in the block
        divisible_map = (uint64_t)(*p) & 0x4444444444444444;

        // Count the number of 4s in the block,
        // 8-bits at a time
        while (divisible_map) {
          switch (divisible_map & 0x44) {
            case 0x04:
            case 0x40:
                count++;
                break;
            case 0x44:
                count += 2;
                break;
            default:
                break;
          }
          divisible_map >>= 8;
        }
    }
    // Repeat as above with 32, 16, 8 and 4-bit math.
    // Left as an exercise to the reader

    return count;
}

You might have noticed the functions are called foo_aligned_block and p needs to be the byte of the first aligned element. What is that? Since we're packing two elements per byte, the starting element index must be aligned to an even number. If the command in the file is 0 0 30, then we can call increment_algined_block(&arr[A/2], 30), no problem. However, if the command in the file is 0 1 30, then we need to have additional code to handle the unaligned first element at index 1, and then call increment_aligned_block(&arr[A/2 + 1], 29). Again, left as an exercise to the reader.

I'd like to note this is not the most optimal it can be.

Unaligned accesses are usually pretty expensive. That is, reading an 8-byte value from an 8-byte aligned address is faster than from a non-aligned address. We can add additional optimizations to only call foo_aligned_block() such that all accesses are guaranteed to be aligned.

thanks...can u please tell me what are 0x4444444444444444's and 0x1111111111111111's. i mean i have searched my book thoroughly but i cant seem to find operations like these anywhere.i do get your algo though. — gamma, Sep 05 '10 at 15:28
@gamma: These are bit masks which are used to extract only some bits from the bit-map. Write them out in binary and follow the bit-wise AND to see what happens. — slacker, Sep 05 '10 at 15:46
It is better to use the {0,1,2,(3)} sequence. You can then fit the modulus in only two bits. While testing for "3" is slightly more complicated and expensive, this is vastly offset by the advantage of putting **TWICE** as much moduli in a single word. See my second comment under Dummy00001's answer. — slacker, Sep 05 '10 at 15:50

how to make the execution time less(ie. a faster code) for this problem

7 Answers7