-2

I am very new in assembly and I want to find all pythagorean triples in a range from 1 to 100. I'm generating all the numbers in C and all the other calculations should be done in assembly SSE. I was trying to do this by using sqrt command (I've tried all of them) but I couldn't make it work.. Can someone tell me how it should be done?

That's what I've got so far:

int main(){
            for (int i = 1; i <= 100; i++)
            {
                a++;
                if (a > 100)
                    a = 0;
                for (int j = 1; j <= 100; j++)
                {
                    b++;
                    if (b > 100)
                        b = a;
                    _asm   //tricky part begins here:
                    {
                        movups xmm0, a
                        movups xmm1, b
                        pmuludq xmm0, xmm0  
                        pmuludq xmm1, xmm1 
                        //movups xmm2, 0  
                        //paddd xmm2, xmm0  
                        //paddd xmm2, xmm1
                        movups z, xmm0
                    }
                    printf("%d\n", z);
                }
          }
    }
Rukass
  • 27
  • 4
  • 7
    "I want to do this in assembly because I know that it's faster than C." How do you know that? Because it probably won't be. –  Dec 19 '16 at 22:28
  • 6
    I wouldn't assume that your hand-written asm will be faster than a C compiler output. Compilers are smart, modern CPUs are complex. – Blorgbeard Dec 19 '16 at 22:28
  • Write down your algorithm in C or pseudocode first. – Jester Dec 19 '16 at 22:32
  • you need to actually vectorize the code to get a speedup. Just using as a copro is much easier, just switch to win64 as target. – Marco van de Voort Dec 19 '16 at 22:36
  • 2
    I thought all [Pythagorean triples](https://en.wikipedia.org/wiki/Pythagorean_triple) of the form `n*(a*a = b + c)` or something like that? Some math should speed up the process more that C vs. assembly. – chux - Reinstate Monica Dec 19 '16 at 22:56
  • I've edited my question. Speed is not the only reason why I want to try assembler. I also want to learn it (at least basics). – Rukass Dec 19 '16 at 22:57
  • 4
    If you want to learn asm, write the whole loop (or better, the whole function) in asm. Using MSVC inline asm this way is just shooting yourself in the foot. See http://stackoverflow.com/questions/3323445/what-is-the-difference-between-asm-and-asm/35959859#35959859, and also http://stackoverflow.com/a/41208270/224132, which explain why MSVC inline asm sucks for this. – Peter Cordes Dec 19 '16 at 23:08
  • But the basic problem with your approach is that you need to be looking at 4 `b` values in parallel, so you can't just load from a C scalar variable. You need to keep stuff in vector registers across loop iterations, since you aren't just loading vectors from memory or something. – Peter Cordes Dec 19 '16 at 23:11
  • 2
    @Blorgbeard and Rukass: It's almost always possible to beat the compiler with hand-written asm, but you have to know what you're doing. It's also usually best to start with optimized compiler output, so you at least won't do worse. But usually the best thing is to tweak your C source to hand-hold the compiler into making better asm, after you figure out what is optimal. See my [Collatz-conjecture hand-written asm answer](http://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466) for more about all of that. – Peter Cordes Dec 19 '16 at 23:15
  • @Peter absolutely agreed - I was just pointing out OP's false assumption that hand-written ASM *must* be faster than equivalent C code. – Blorgbeard Dec 19 '16 at 23:57
  • 1
    @Blorgbeard: yeah, totally agreed, and what I meant by "have to know what you're doing" :) Compilers aren't smart, though. Impressive and powerful, sure, but not smart. They're very well-designed machines that work really well most of the time. They often miss ways to save a few instructions, but that doesn't matter. The do occasionally generate really braindead code, e.g. auto-vectorizing by unpacking bytes to 32-bit and then packing them back again because that's what the C source did even though wider temporaries didn't affect the result. – Peter Cordes Dec 20 '16 at 00:20

1 Answers1

2

The basic problem with your approach is that you need to be looking at 4 b values in parallel, so you can't just load from a C scalar variable. You need to keep stuff in vector registers across loop iterations, since you aren't just loading vectors from memory or something. You should write the whole loop in asm because MSVC inline asm sucks for wrapping short sequences, due to unavoidable overhead of getting results in / out.

Of course, the best way to vectorize this loop would be with C intrinsics, not with inline asm. Then you can hand-hold the compiler into making better asm if necessary (and if possible) by checking its asm output for inefficiencies. (See Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture?)


Of course, if you really just wanted to create efficient code for generating Pythagorean triples, your algorithm is bogus, too:

The Wikipedia article has a generating a triple section which describes Euclid's formula. Iterating that would be a different problem than checking for hits in a brute-force search of the whole a=[1..100] b=[1..100] search space, since checking if a number is a perfect square is fairly slow.

Also, detecting which vector elements match a condition is clumsy. A packed-compare instruction and then PMOVMSKB (or MOVMSKPS) will give you a bitmap, but this works best when hits are rare, e.g. implementing memchr where your loop stops after the first hit.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Note that checking the entire `a=[1..100] b=[1..100]` search space takes less than .005 seconds on an average desktop computer, so the whole premise of the question is absurd. – user3386109 Dec 20 '16 at 06:38
  • @user3386109: That's still a million or two core clock cycles, long enough to measure accurately with perf counters. But obviously the point applies more to larger search ranges. And calling `printf` in the middle of the search on each hit is ridiculous, vs. storing them in an array or something. – Peter Cordes Dec 20 '16 at 07:39