x86 Assembly - 2 largest values out of 4 given numbers

Question

I'm writing a C subroutine in assembler that needs to find the 2 largest values out of 4 values passed in and multiplies them together. I'm working on finding the largest values but I'm a bit stuck. I have this to find the largest value but I can't seem to reason how to get the second highest. Any suggestions would be appreciated

first:                             
     push bp                            
     mov  bp,sp                            
     mov  ax,[bp+4]
     mov  [max1],ax
     cmp  ax,[bp+6]
     jge  skip1
     mov  ax,[bp+6]
     mov  max1,ax

skip1:
     mov  ax,max1
     cmp  ax,[bp+8]
     jge  skip2
     mov  ax,[bp+8]
     mov  max1,ax

skip2:  
     mov  ax,max1

     cmp  ax,[bp+10]
     jge  mult


mult:
     mul [max1],[max2]
     jmp fin


fin:
     pop bp                       
     ret                          

     end

Many ways, but it is more an algorithm question than an asm one. You could sort the numbers and pick the first two, or write a function that finds the largest of an array and apply it twice, removing the one found. — Jester, Oct 24 '16 at 23:46
`mul [max1],[max2]`: You need imul if you want to use two explicit operands. And no x86 instruction takes two explicit memory operands. See the [x86 tag wiki](http://stackoverflow.com/tags/x86/info) for links to the insn set reference manual. — Peter Cordes, Oct 25 '16 at 07:08
Compare the first two values, and swap them if needed. Then compare the last two values, and swap them if needed. Finally, do the same with the middle two values. Then multiply the first two. This can be done with all four values in registers. — Terje D., Oct 25 '16 at 07:12
@TerjeD.: oh yeah, that's good. I was just thinking about ways to do it that produce the right product without caring about the order of highest vs. 2nd highest. e.g. with SSE2, generate all 6 pairwise products and then find the max with a few PMAXSW instructions. — Peter Cordes, Oct 25 '16 at 07:14
@PeterCordes I now see that a final swap of first and last may be needed as well (in case the largest values is the last two). — Terje D., Oct 25 '16 at 07:34
@TerjeD.: I realized just after finishing my answer that two negative numbers could have a larger product than the product of the largest two signed inputs. Oops. — Peter Cordes, Oct 25 '16 at 08:49

Peter Cordes · Answer 1 · 2016-10-25T10:54:54.320

Presumably you weren't looking for a SIMD answer, but I though it would be interesting to write. And yes, SSE instructions work in 16-bit mode. VEX-encoded instructions don't, so you can't use the AVX 3-operand versions. Fortunately, I was able to write it without any extra MOVDQA instructions anyway, so AVX wouldn't help.

IDK how to answer this the way you probably want without just doing your homework for you. If you're actually interested in a high performance implementation, rather than just anything that works, please update your question.

Since you only need to return the product of the two highest numbers, you could just produce all 6 pairwise products and take the max. (4 choose 2 = 6).

If brute force doesn't work, you aren't using enough :P

update: I just realized that this will give the wrong answer if the largest pairwise product is from two negative numbers. It will work if you can rule out negative inputs, or otherwise rule out inputs where this is a problem. See below for an SSE4.1 version that finds the max and 2nd-max separately.

This does the trick with no branching, using SSE2. (You could do the same thing in MMX registers using only SSE1, which added the MMX-register version of PMAXSW). It's just 11 instructions (not counting the prologue/epilogue), and they're all fast, mostly single-uop on most CPUs. (See also the x86 tag wiki for more x86 links)

;; untested, but it does assemble (with NASM)
BITS 16

;; We only evaluate 16-bit products, and use signed comparisons on them.
max_product_of_4_args:
   push    bp
   mov     bp, sp

   ; load all 4 args into a SIMD vector
   movq    xmm0, [bp+4]              ;xmm0 = [ 0...0 d c b a ] (word elements)
   pshuflw xmm1, xmm0, 0b10010011    ;xmm1 = [ 0..   c b a d ] (rotated left)
   pshufd  xmm2, xmm0, 0b11110001    ;xmm2 = [ 0..   b a d c ] (swapped)
   pmullw  xmm1, xmm0                ; [ 0..  cd bc ab ad ]  (missing ac and bd)                                                                                    
   pmullw  xmm2, xmm0                ; [ 0..  bd ac bd ac ]

   ; then find the max word element between the bottom halves of xmm1 and xmm2
   pmaxsw  xmm1, xmm2
   ; now a horizontal max of xmm1
   pshuflw xmm0, xmm1, 0b00001110    ; elements[1:0] = elements[3:2], rest don't care
   pmaxsw  xmm0, xmm1
   pshuflw xmm1, xmm0, 0b00000001
   pmaxsw  xmm0, xmm1

   ; maximum product result in the low word of xmm0
   movd    eax, xmm0
   ; AX = the result.  Top half of EAX = garbage.  I'm assuming the caller only looks at a 16-bit return value.                                                     

   ; To clear the upper half of EAX, you could use this instead of MOVD:
   ;pextrw  eax, xmm0, 0                                                                                                                                            
   ; or sign extend AX into EAX with CWDE                                                                                                                           

fin:                                                                                                                                                               
     pop bp                                                                                                                                                         
     ret                                                                                                                                                            
end

If you want 32-bit products, PMAXSD is part of SSE4.1. Maybe unpack with zeros (or PMOVZXWD), and use PMADDWD to do 16b * 16b->32b vector multiplies. With the odd elements all zero, the horizontal add part of PMADDWD just gets the result of the signed multiply in the even elements.

Fun fact: MOVD and pextrw eax, xmm0, 0 don't need an operand-size prefix to write to eax in 16-bit mode. The 66 prefix is already part of the required encoding. pextrw ax, xmm0, 0 doesn't assemble (with NASM).

Fun fact #2: ndisasm -b16 incorrectly disassembles the MOVQ load as movq xmm0, xmm10:

$ nasm -fbin 16bit-SSE.asm

$ ndisasm -b16 16bit-SSE
...
00000003  F30F7E4604        movq xmm0,xmm10
...

$ objdump -b binary -Mintel -D  -mi8086 16bit-SSE
...
3:   f3 0f 7e 46 04          movq   xmm0,QWORD PTR [bp+0x4]
...

design notes for the 2 shuffle, 2 multiply way.

[  d  c  b  a ] ; orig
[  c  b  a  d ] ; pshuflw
  cd bc ab ad :  missing ac and bd

[  b  a  d  c ] ; pshuflw.  (Using psrldq to shift in zeros would produce zero, but signed products can be < 0)
 ;; Actually, the max must be > 0, since two odd numbers will make a positive

I looked at trying to only do one PMULLW by creating inputs for it with two shuffles. It would be easy with PSHUFB (with a 16-byte mask constant).

But I'm trying to limit it to SSE2 (and maybe code that could be adapted to MMX). Here's one idea that didn't pan out.

[  d  d  c  c  b  b  a  a ]   ; punpcklwd
[  b  a  b  a  b  a  d  c ]   ; pshufd
  bd ad bc ac bb ab ad ac

: ab ac ad
:    bc bd
:       cd(missing)
:             bb(problem)

I'm not even sure that would be better. It would need an extra shuffle to get the horizontal max. (If our elements were unsigned, maybe we could use SSE4.1 PHMINPOSUW on 0 - vec to find the max in one go, but the OP is using signed compares.)

SSE4.1 PHMINPOSUW

We can add 32768 to each element and then use unsigned stuff.

Given a signed 16-bit val: rangeshift = val + 1<<15 maps the lowest to 0, and the highest to 65535. (add, subtract, or XOR (add-without-carry) are all equivalent for this.)

Since we only have an instruction to find the horizontal minimum, we can reverse the range with negation. We need to do that first, because 0 stays 0, while 0xFFFF becomes 0x0001, etc.

So -val + 1<<15, or mapped = 1<<15 - val maps our signed values to unsigned, in such a way that the lowest unsigned value is the greatest signed value. To reverse this: val = 1<<15 - mapped.

Then we can use PHMINPOSUW to find the lowest (unsigned) word element (the max original element), mask that to all-ones, then PHMINPOSUW again to find the second-lowest.

push    bp
mov     bp, sp

pcmpeqw  xmm5, xmm5         ; xmm5 = all-ones (anything compares == itself)
psrlw    xmm5, 15           ; _mm_set1_epi16(1<<15)

movq     xmm0, [bp+4]
psubw    xmm5, xmm0         ; map the signed range to unsigned, in reverse order

phminposuw xmm1, xmm5       ; xmm1 = [ 0...  minidx  minval ]
movd     eax, xmm1          ; ax = minval

psrldq   xmm1, 2            ; xmm1 = [ 0...          minidx ]
psllw    xmm1, 4            ; xmm1 = [ 0...          minidx * 16 ]

pcmpeqw  xmm2, xmm6
psrlq    xmm2, 48           ; xmm2 = _mm_set1_epi64(0xFFFF)

psllq    xmm2, xmm1         ; xmm2 = _mm_set1_epi64(0xFFFF << (minidx*16))
; force the min element to 65535, so we can go again and get the 2nd min (which might be 65535, but we don't care what position it was in)
por      xmm2, xmm5

phminposuw xmm3, xmm2
movd     edx, xmm3          ; dx = 2nd min, upper half of edx=garbage (the index)

mov      cx, 1<<15          ; undo the range shift
neg      ax
add      ax, cx
sub      cx, dx

imul     cx                 ; signed multiply dx:ax = ax * cx
pop      bp
ret                         ; return 32-bit result in dx:ax (or caller can look at only the low 16 bits in ax)

This is more instructions. It might not be better than a CMP/CMOV sorting network using integer registers. (See @Terje's comment for a suggestion on what compare-and-swap to use).

I'm planning to implement the brute force method(thanks for the suggestion and detailed answer) Is the best way to go about it like: 1.define 6 constants for each product 2.move val1 to ax, multiply it to val2 and store in product1 3.do this for each product 4.compare each product. I'm very novice with assembly programming so none of this is coming easy to me yet. — , Oct 25 '16 at 17:13
@AaronHiller: I think you mean "6 storage locations". They're not constants if you overwrite them! But anyway, no, that's not the best way. If you're going to brute-force it with scalar code, just keep the current current max in a register, and update it if necessary after every multiply. Probably use two nested loops to do the equivalent of `for(int i=3 ; i>=0 ;--i) { for(int j=i-1 ; j>=0 ; --j) { prod=max(prod, args[i]*args[j]); }}` — Peter Cordes, Oct 26 '16 at 01:43

Paolo Fassin · Answer 2 · 2017-10-02T11:42:38.850

This is my first solution to process the maximum number between four 16 Bit unsigned integer input numbers (80x86+ processors compatibile):

Procedure Max4; Assembler;

{ Input: AX, BX, CX, DX
 Output: AX= MAX(AX,BX,CX,DX).
   Temp: DI}

Asm

{ Input: AX, BX
 Output: AX= MAX(AX,BX).
   Temp: DI}

     Sub   AX,BX
     CmC
     SbB   DI,DI
     And   AX,DI
     Add   AX,BX

{ Input: CX, DX
 Output: CX= MAX(CX,DX).
   Temp: DI}

     Sub   CX,DX
     CmC
     SbB   DI,DI
     And   CX,DI
     Add   CX,DX

{ Input: AX, CX
 Output: AX= MAX(AX,CX).
   Temp: DI}

     Sub   AX,CX
     CmC
     SbB   DI,DI
     And   AX,DI
     Add   AX,CX

End;

My procedure Max4(), that is equivalent to AX=Max4(AX,BX,CX,DX), works great to a AX=Max(AX,BX) sub-routine that return a maximum value between two numbers and it is used three time:

AX=Max(Max(AX,BX),Max(CX,DX))

The AX=Max(AX,BX) sub-routine works as follow:

1) Diff=AX-BX.
2) If Diff>=0 then AX is the greatest number,
   Output= Diff+BX= AX-BX+BX= AX.
3) If Diff<0 then BX is the greatest number,
   must set Diff to 0,
   Output= Diff+BX= 0+BX= BX.

In ASSEMBY:

{ Input: AX, BX
 Output: AX= MAX(AX,BX).
   Temp: DI}

     Sub   AX,BX
    {Diff= AX-BX}
     CmC
    {If Diff>=0 -> FCarry=1 else FCarry=0}
     SbB   DI,DI
    {If Diff>=0 -> DI=DI-DI-1==-1 else DI=DI-DI-0==0}
     And   AX,DI
    {If Diff>=0 -> Diff=(Diff & -1)==Diff else Diff=(Diff & 0)==0}
     Add   AX,BX
    {AX= Diff+BX}

But this solution works with unsigned 16 Bit numbers only and process only one largest number (don't do the multiplication). The next solution works correctly on 80x86+ processors (works with signed integer numbers; process two largest numbers):

Function Max42R(A,B,C,D:Integer):LongInt; Assembler;

Asm

     Mov   AX,A
     Mov   BX,B
     Mov   CX,C
     Mov   DX,D

   {1ø swap (when needed), 1ø scan}

     Cmp   AX,BX
     JLE   @01

     XChg  AX,BX

   {2ø swap (when needed), 1ø scan}

 @01:Cmp   BX,CX
     JLE   @02

     XChg  BX,CX

   {3ø swap (when needed), 1ø scan}

 @02:Cmp   CX,DX
     JLE   @03

     XChg  CX,DX

   {1ø swap (when needed), 2ø scan}

 @03:Cmp   AX,BX
     JLE   @04

     XChg  AX,BX

   {2ø swap (when needed), 2ø scan}

 @04:Cmp   BX,CX
     JLE   @05

     XChg  BX,CX

 {DX is the first greatest number;
  CX is the second greatest number}

 @05:Mov   AX,DX
     Mul   CX

End;

It is a variation of bubble-sort algorithm. In the bubble-sort yo must compare each pair of adjacent numbers in the array and swap them if the first is greater than the second; repeat the array scan if a swap is occurred until the array is sorted. But after the first scan the last value of the array is always the greatest number. Supposing that the four input values are as in a virtual array, I swap the first three pairs of registers, only when needed, to get the first greatest value,

that is in the last register. After, I swap the first two pairs of registers, only when needed, to get the second greatest value, that is in the penultimate register.

Max4() procedure can be written on 80386+ processor as follow (support 32 Bit signed integer numbers; process one largest number):

Function Max4I(A,B,C,D:Integer):Integer; Assembler;

{ Input: EAX, EBX, ESI, EDI
 Output: EAX= MAX(EAX,EBX,ESI,EDI).
   Temp: CX.

  EAX EDX ECX are 1°, 2° AND 3° PARAMETERs.
  Can freely modify the EAX, ECX, AND EDX REGISTERs. }

Asm

     Push  EBX
     Push  EDI
     Push  ESI

{------------------------}

     Mov   EAX,A
     Mov   EBX,B
     Mov   ESI,C
     Mov   EDI,D

{ Input: EAX, EBX
 Output: EAX= MAX(EAX,EBX).
   Temp: ECX}

     Sub   EAX,EBX
     Mov   ECX,0
     SetL  CL
     Dec   ECX
     And   EAX,ECX
     Add   EAX,EBX

{ Input: EAX, ESI
 Output: EAX= MAX(EAX,ESI).
   Temp: ECX}

     Sub   EAX,ESI
     Mov   ECX,0
     SetL  CL
     Dec   ECX
     And   EAX,ECX
     Add   EAX,ESI

{ Input: EAX, EDI
 Output: EAX= MAX(EAX,EDI).
   Temp: ECX}

     Sub   EAX,EDI
     Mov   ECX,0
     SetL  CL
     Dec   ECX
     And   EAX,ECX
     Add   EAX,EDI

{------------------------}

     Pop   ESI
     Pop   EDI
     Pop   EBX

End;

Finally the definitive solution that get the two greatest numbers between four 32 Bit signed integer numbers (80386+ processors). It works as Max42R() function:

Function Max42(A,B,C,D:Integer):Integer; Assembler;

{ Input: EAX, EBX, ESI, EDI
 Output: EDI= 1° MAX(EAX,EBX,ESI,EDI).
         ESI= 2° MAX(EAX,EBX,ESI,EDI).
   Temp: ECX, EDX.

  EAX EDX ECX are 1°, 2° AND 3° PARAMETERs.
  Can freely modify the EAX, ECX, AND EDX REGISTERs. }

Asm

     Push  EBX
     Push  EDI
     Push  ESI

     Mov   EAX,A
     Mov   EBX,B
     Mov   ESI,C
     Mov   EDI,D

{ Input: EAX, EBX
 Output: EAX= MIN(EAX,EBX).
         EBX= MAX(EAX,EBX).
   Temp: ECX, EDX}

     Sub   EAX,EBX
     Mov   EDX,EAX
     Mov   ECX,0
     SetGE CL
     Dec   ECX
     And   EAX,ECX
     Add   EAX,EBX
     Not   ECX
     And   EDX,ECX
     Add   EBX,EDX

{ Input: EBX, ESI
 Output: EBX= MIN(EBX,ESI).
         ESI= MAX(EBX,ESI).
   Temp: ECX, EDX}

     Sub   EBX,ESI
     Mov   EDX,EBX
     Mov   ECX,0
     SetGE CL
     Dec   ECX
     And   EBX,ECX
     Add   EBX,ESI
     Not   ECX
     And   EDX,ECX
     Add   ESI,EDX

{ Input: ESI, EDI
 Output: ESI= MIN(ESI,EDI).
         EDI= MAX(ESI,EDI).
   Temp: ECX, EDX}

     Sub   ESI,EDI
     Mov   EDX,ESI
     Mov   ECX,0
     SetGE CL
     Dec   ECX
     And   ESI,ECX
     Add   ESI,EDI
     Not   ECX
     And   EDX,ECX
     Add   EDI,EDX

{ Input: EAX, EBX
 Output: EAX= MIN(EAX,EBX).
         EBX= MAX(EAX,EBX).
   Temp: ECX, EDX}

     Sub   EAX,EBX
     Mov   EDX,EAX
     Mov   ECX,0
     SetGE CL
     Dec   ECX
     And   EAX,ECX
     Add   EAX,EBX
     Not   ECX
     And   EDX,ECX
     Add   EBX,EDX

{ Input: EBX, ESI
 Output: EBX= MIN(EBX,ESI).
         ESI= MAX(EBX,ESI).
   Temp: ECX, EDX}

     Sub   EBX,ESI
     Mov   EDX,EBX
     Mov   ECX,0
     SetGE CL
     Dec   ECX
     And   EBX,ECX
     Add   EBX,ESI
     Not   ECX
     And   EDX,ECX
     Add   ESI,EDX

{EDI contain the first maximum number;
 ESI contain the second maximum number}

     Mov   EAX,EDI

{------------------------}

     Pop   ESI
     Pop   EDI
     Pop   EBX

End;

How to swap two register only if the first is greater then the second ?

This is the code (on the 80386+):

{ Input: EAX, EBX
 Output: EAX= MIN(EAX,EBX).
         EBX= MAX(EAX,EBX).
   Temp: ECX, EDX}

     Sub   EAX,EBX (* Diff= EAX-EBX; set Overflow flag and Sign flag *)
     Mov   EDX,EAX (* EDX= Diff; flags not altered *)
     Mov   ECX,0   (* ECX= 0; flags not altered *)
     SetGE CL      (* If Sign flag == Overflow flag ECX= 1 else ECX=0 *)
     Dec   ECX     (* If Diff>=0, ECX=0 else ECX=-1 *)
     And   EAX,ECX (* If Diff>=0, EAX=(EAX & 0)=0 else EAX=(EAX & -1)=EAX *)
     Add   EAX,EBX (* EAX= Minimum value between input n. *)
     Not   ECX     (* If Diff<0, ECX=0 else ECX=-1 *)
     And   EDX,ECX (* If Diff<0, EDX=(EDX & 0)=0 else EDX=(EDX & -1)=EDX *)
     Add   EBX,EDX (* EBX= Maximum value between input n. *)

Function Max42 can be written also as the next code for 80686+ processors, that require only 20 fast ASM registers' instructions:

Function Max42B(A,B,C,D:Integer):Integer; Assembler;

{ Input: EAX, EBX, ESI, EDI
 Output: EDI= 1° MAX(EAX,EBX,ESI,EDI).
         ESI= 2° MAX(EAX,EBX,ESI,EDI).
   Temp: ECX.

  EAX EDX ECX are 1°, 2° AND 3° PARAMETERs.
  Can freely modify the EAX, ECX, AND EDX REGISTERs. }

Asm

     Push   EBX
     Push   EDI
     Push   ESI

     Mov    EAX,A
     Mov    EBX,B
     Mov    ESI,C
     Mov    EDI,D

{ Input: EAX, EBX
 Output: EAX= MIN(EAX,EBX).
         EBX= MAX(EAX,EBX).
   Temp: ECX}

     Mov    ECX,EAX
     Cmp    EAX,EBX
     CMovGE EAX,EBX
     CMovGE EBX,ECX

{ Input: EBX, ESI
 Output: EBX= MIN(EBX,ESI).
         ESI= MAX(EBX,ESI).
   Temp: ECX}

     Mov    ECX,EBX
     Cmp    EBX,ESI
     CMovGE EBX,ESI
     CMovGE ESI,ECX

{ Input: ESI, EDI
 Output: ESI= MIN(ESI,EDI).
         EDI= MAX(ESI,EDI).
   Temp: ECX}

     Mov    ECX,ESI
     Cmp    ESI,EDI
     CMovGE ESI,EDI
     CMovGE EDI,ECX

{ Input: EAX, EBX
 Output: EAX= MIN(EAX,EBX).
         EBX= MAX(EAX,EBX).
   Temp: ECX}

     Mov    ECX,EAX
     Cmp    EAX,EBX
     CMovGE EAX,EBX
     CMovGE EBX,ECX

{ Input: EBX, ESI
 Output: EBX= MIN(EBX,ESI).
         ESI= MAX(EBX,ESI).
   Temp: ECX}

     Mov    ECX,EBX
     Cmp    EBX,ESI
     CMovGE EBX,ESI
     CMovGE ESI,ECX

{EDI contain the first maximum number;
 ESI contain the second maximum number}

     Mov   EAX,EDI

{------------------------}

     Pop   ESI
     Pop   EDI
     Pop   EBX

End;

Hi!

This would be a better answer if you explained how / why it works. It looks fairly efficient. Also note that `max(4 inputs)` wasn't the question. The question is to find the *2* largest inputs (and return their product). That's why my answer forms the 6 pairwise products and finds the max of that. — Peter Cordes, Sep 30 '17 at 23:15
Please to see the response 0 before this solution, thank you. — Paolo Fassin, Oct 01 '17 at 14:23
When 2 answers have the same score, it's random which order they sort in when you load the page. e.g. on my screen, your newer answer is below this one. (And if you reload, it probably changes for you, too.) So it's better to say "my other answer" or "@username's answer" instead of referring to them by number or position (or especially by vote-count, because that can obviously change.) — Peter Cordes, Oct 01 '17 at 14:35
I don't think it's useful to have 2 separate answers, since they're not *really* separate. Posting the explanation for one answer as a separate answer is not OK. [edit] them into one long answer. (I've posted separate answers when one is for AVX2, and the other is for AVX512 and uses instructions that aren't available in AVX2. But that's not what you're doing here.) — Peter Cordes, Oct 01 '17 at 14:39
I've created two new solution and I think that at least one of they is valid. Check it! I rewrote the article. — Paolo Fassin, Oct 03 '17 at 18:34
Yeah, that's an improvement. Thanks for tidying up your answers into one. Your final CMOV version is nice, using mov+cmp+2x cmov as a comparator like SSE `pmind` / `pmaxd` or `maxps` / `minps` (the same building block people use for branchless SIMD sorting networks). With a small change in register usage, you could arrange for the final result to be produced in `eax` and save a MOV. (And you forgot `imul eax, esi`.) Using `edx` instead of `ebx` would save a push/pop. — Peter Cordes, Oct 03 '17 at 18:59
I don't see any other savings from using `cmovnge` or any way to avoid the `mov` in each block. I didn't read it fully; if you're doing a full 4-element sorting network (https://en.wikipedia.org/wiki/Sorting_network), you can remove the CMOVs that produce the low 2 elements, and just let them be overwritten. Yeah, 5 comparators is enough for a full sort of 4 elements, according to wikipedia's sorting network article, so it's overkill here. — Peter Cordes, Oct 03 '17 at 19:02
Your non-cmov comparator is a bit clunky. You can [avoid a partial-register stall on Intel CPUs (pre-Ivybridge)](https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and) by doing `xor ecx,ecx` / `Sub EAX,EBX` / `setge cl`, instead of `mov ecx,0` after the `sub`. — Peter Cordes, Oct 03 '17 at 19:06
I am very glad to have solved such an important problem (I assume both for matrix processing and graphic processing). I hope to be the sole holder of the rights to this software and to earn a little. Thanks for the vote. — Paolo Fassin, Oct 03 '17 at 19:30
I hope you're joking, but if not, this was already a solved problem. You just re-invented it on your own. See https://stackoverflow.com/a/7181993/224132 for example, and some other answers explicitly use `cmp`/`cmov` to implement a comparator. See also the other answers to [Fastest sort of fixed length 6 int array](https://stackoverflow.com/questions/2786899/fastest-sort-of-fixed-length-6-int-array) — Peter Cordes, Oct 03 '17 at 19:41
Yes, I'm joking, it's obvious. How can you invent something new today? However, it is interesting to have a good reputation on Stack Overflow. It can be useful. — Paolo Fassin, Oct 03 '17 at 20:18
Please see my routine for processing of 64 Bit integer number square root. Thank you! I've not found a specific question for my solution, so i have inserted it in an 80x86 question. See " x86 assembly (MASM) - Square root of a 64-bit integer?" ... — Paolo Fassin, Oct 03 '17 at 20:23

score 1 · Answer 3 · answered Oct 26 '16 at 18:02

A naive beginners way to find two max numbers (I hope this will get you unstuck on the reasoning, how to get second highest ... you simply search also for second highest, while searching for the highest):

    push    bp
    mov     bp,sp
    mov     ax,[bp+4]   ; temporary max1 = first argument
    mov     bx,8000h    ; temporary max2 = INT16_MIN
    ; max2 <= max1
    mov     dx,[bp+6]
    call    updateMax1Max2
    mov     dx,[bp+8]
    call    updateMax1Max2
    mov     dx,[bp+10]
    call    updateMax1Max2

    ; ax and bx contains here max1 and max2
    imul    bx            ; signed multiplication, all arguments are signed
    ; dx:ax = max1 * max2

    ; "mul" would produce wrong result for input data like -1, -2, -3, -4

    pop     bp
    ret

updateMax1Max2:
    ; dx is new number, [ax, bx] are current [max1, max2] (max2 <= max1)
    cmp     bx,dx       ; compare new value to lesser max2
    jge     updateMax1Max2_end
    mov     bx,dx       ; new max2
    cmp     ax,dx       ; compare new value to greater max1
    jge     updateMax1Max2_end  ; new max2 is already <= max1
    xchg    ax,bx       ; new value promoted to new max1, old max1 is now max2
updateMax1Max2_end:
    ret

It's keeping two temporary max values at the same time, for the price of a bit more complex update (testing new value not only against single max, but also against the second one).

Then it somewhat optimized by keeping the two temporaries in specified order, so when new value is lower than max2, it is discarded immediately, not testing against max1.

That complex "is the new value bigger than already kept max1/max2" code is put into separate sub-routine, so it can be reused several times.

And finally the initial state of [max1,max2] is set to [first_argument, INT16_MIN], so that sub-routine can be applied for the remaining three arguments in the simple way (getting the code complexity somewhat back by reusing the code a lot).

Peter's and Terje's suggestions provide great insight into advanced possibilities, but they also nicely demonstrate how performance asm coding can be tricky (as they both had to add errata to their original ideas).

When stuck or in doubt, try to do the most straightforward solution available (like you would solve it as human). Just try to keep number of instructions low (writing it in generic way, reusing any bigger part of code in sub-routines when possible), so it's easy to debug and comprehend.

Then feed that with several possible inputs, exercising also corner cases ([some example values], [INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN], [INT16_MAX, INT16_MAX, INT16_MAX, INT16_MAX], [-1, -2, -3, -4], [-2, -1, 0, INT16_MAX], etc...), and verify the results are correct (ideally in some code too, so you can rerun all the tests after next change to the routine).

This is the crucial step, which will save you from your original wrong assumptions, overlooking some corner case results. In ideal case don't even run your code directly, go straight into debugger and single step every of those test cases, to validate not only result, but also keep checking if the internal state during calculation is working as expected.

After that you may check for some "code golfing", how to exploit all the properties of the situation to lower the workload (simplifying the algorithm) and/or number of instructions and how to replace performance-hurting code with alternative faster approach.

I use "[pairs / quartets]" in text and comments with square brackets in mathematical sense, to work with the set (pair/quartet) of values. It has no relevance to the Assembly square brackets used in code to read memory content. — Ped7g, Oct 26 '16 at 18:08

x86 Assembly - 2 largest values out of 4 given numbers

3 Answers3

SSE4.1 PHMINPOSUW

Linked