With a byte array that is mostly zero, being a sparse array, you can take advantage of a 32 bit CPU by doing comparisons 4 bytes at a time. The actual comparisons are done 4 bytes at a time however if any of the bytes are non-zero then you have to determine which of the bytes in the unsigned long are non-zero so that will take more effort. If the array is really sparse then the time saved with the comparisons may compensate for the additional work determining which of the bytes are non-zero.
The easiest would be to make the unsigned char array sized to some multiple of 4 bytes so that you do not need to worry about doing the last few bytes after the loop completes.
I would suggest doing a timing study on this as it is purely conjectural and there would be a point where an array becomes un-sparse enough that this would take more time than a simple loop.
One question that I would have is what are you doing with the vector of offsets of non-zero elements of the array and whether you can do away with the vector. Another question is if you need the vector whether you can build the vector as you place elements into the array.
unsigned char* array=new unsigned char[4000000];
......
unsigned long *pUlaw = (unsigned long *)array;
for ( ; pUlaw < array + 4000000; pUlaw++) {
if (*pUlaw) {
// at least one byte is non-zero
unsigned char *pUlawByte = (unsigned char *)pUlaw;
if (*pUlawByte)
somevector.push_back(pUlawByte - array);
if (*(pUlawByte+1))
somevector.push_back(pUlawByte - array + 1);
if (*(pUlawByte+2))
somevector.push_back(pUlawByte - array + 2);
if (*(pUlawByte+3))
somevector.push_back(pUlawByte - array + 3);
}
}