-1

Let's say we have a given string of chars

DataString DB 'AGIJKSZ', 0FFH ; 

What would be the most time-effective procedure to find let's say J in it? By time-effective I mean least amount of clock ticks.

It's a x86 processor with these instruction sets:

MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX, AVX2, FMA3, TSX

Let's assume that both string and searched character can be changed but only via editing the code, and we're always looking for a single character. The string is ASCII. End of string is marked by FF

Answer should be just setting EAX to found 1 / not found 0.

This is what I can think of

FindChar_1 PROC
 MOV ESI, OFFSET DataString ; 
SI
 MOV AH, 'J' ; 
Check_End:
 CMP BYTE PTR [ESI], 0FFH ; 
 JE Not_Find ; 
 CMP AH, [ESI] ; 
'DataString'
 JE Got_Equal ;
 ADD ESI, 1 ; 
 JMP Check_End ;
Got_Equal:
 MOV DL, [ESI] ; 
 JMP Done
Not_Find:
 MOV EAX,0 ; 
 RET ;
Done:
 MOV EAX,1 ; 
 RET ; 
FindChar_1 ENDP

EDIT:

Now I realize that there I something else I should have mentioned. I'm using masm32 so instructions I can use are limited to the very basic ones.

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
Jarek N
  • 55
  • 9
  • On what CPU? x86 with SSE2? AVX2? What microarchitecture? Is the search pattern a fixed constant? Or is the string a fixed constant (like you've shown here) and the character to find is variable? Or are both runtime variables? Is the string-length variable? If it's long, is it hot in cache or not? **Is the search pattern always a single byte?** Optimizing `strchr` is very different function from `strstr`. Is the string ASCII, or is it a variable-width encoding like UTF-8? – Peter Cordes Mar 12 '18 at 10:15
  • 2
    tl;dr this is *far* too broad, and it's not even clear what problem you're asking. I mean the most efficient thing in this case is `Jpos equ 4`, if both the pattern and string are constants, then the result is a constant. (That's the one case where the speed doesn't depend on the CPU :P) – Peter Cordes Mar 12 '18 at 10:19
  • Ok, so a CPU with AVX2. Ryzen, Excavator, Haswell, Broadwell, Skylake, or Knight's Landing? Different microarchitectures will take different numbers of clock cycles for different operations, and have different ability to do out-of-order execution of the search with surrounding code. (http://agner.org/optimize, and https://stackoverflow.com/tags/x86/info). Does the surrounding code bottleneck on the latency of the string search, or on total uop throughput? i.e. **are you optimizing for latency or throughput?** Total clock cycles isn't the sum of clocks for each part of your code. – Peter Cordes Mar 12 '18 at 10:50
  • 1
    How is the string's length determined? Is it NULL terminated? Does it have a fixed length? Are we allowed to read past the end of the string? Is it known in advance that the character will be present? – fuz Mar 12 '18 at 10:54
  • 3
    You said that both string and search pattern can be build-time constants, but that makes the problem stupid: hard-code the answer as `JPos equ 4`. So what assumptions can we make? Is the string aligned to 16 or 32 bytes? Or at least, can we assume that loading a vector won't segfault? What form do you need the result in? A pointer in an integer register? A bitmap in a vector? A true/false condition without needing to know where it is? Branch on the condition? – Peter Cordes Mar 12 '18 at 10:55
  • Can the string be longer than 32 bytes (one YMM register)? If not, other than hardcoding the answer, `vpbroadcastd ymm0, [4 bytes of JJJJ]` / `vpcmpeqb ymm0, ymm0, [DataString]` / `vpmovmskb eax, ymm0` / `tzcnt eax, eax` gives you the index of the first match in the string (or 32 for not fount). After loading the inputs from memory, that's 1 + 2 + 3 = 6 cycle latency and 3 uops on Haswell/Skylake. Is that what you're looking for, or do you want a function that loops over memory? – Peter Cordes Mar 12 '18 at 11:02
  • If you search the source string often, and it does change rarely, you may pre-calculate `letterFirstPosition['A'..'Z']` look up table, so you will then "search" by doing O(1) `return letterFirstPosition['J'];` - (the lookup table initialization is O(N) where N = source string length, plus you have to clear the table for letters which are not available at all) ... If the source string does change a lot then you can probably by SSE detect the letter on multiple positions at once, but not sure if it's worth it (only if the source string is long enough), otherwise your O(N) looks reasonable. – Ped7g Mar 12 '18 at 14:24
  • *"only via editing the code"* - wait, you mean like source code? Then write it in C++ as compile time `constexpr`, to avoid any calculation and just return 0/1 constants if the letter was in the source string. (i.e. the task doesn't make sense, if it's hardcoded in source, and you are free to spend enough time on compilation, and you have language which allows to write similar task to be resolved at compilation time, like C++ does. You can't beat the resulting `mov eax,1` `ret` ... – Ped7g Mar 12 '18 at 14:29
  • I know the task itself isn't very logical, but that's just for learning purposes. Also, it would be that different if I would make the word a variable, wouldn't it? – Jarek N Mar 12 '18 at 16:48
  • @Ped7g: It's totally worth it to use SSE2, especially if you can pad the string with zeros out to 16 bytes. See my comment before yours: 6 cycle latency (not including the load) with no branching to get the integer index of the first occurrence of `J` in a string up to 16 bytes long. (And Jarek, yes, my vectorized version is good for a variable string. With a fixed string but variable search character, you'd want what Ped7g suggested: a lookup table of the result instead of actually searching! If one string is searched many times, you can pre-compute all kinds of stuff about it.) – Peter Cordes Mar 12 '18 at 23:45
  • *I'm using masm32 so instructions I can use are limited to the very basic ones.* How basic? SSE2 (`pcmpeqb` / `pmovmskb`) is over 15 years old (and is baseline for x86-64), and MMX is about 20 years old. The first part of your question says your CPU has AVX2, so you'll need to use an assembler that doesn't suck (like NASM or YASM) for an answer to what you asked in the first part. – Peter Cordes Mar 12 '18 at 23:52
  • Or with just integer instructions, there are bit-hacks to do this with integer registers, e.g. check if any byte of a word is zero: https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. That will let you write a function that runs faster than 1 byte per clock cycle. (Your current version isn't even an efficient implementation of 1-byte-at-a-time, though. Load into a register instead of using the same memory operand multiple times. And [avoid `AH`](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to)). – Peter Cordes Mar 12 '18 at 23:54
  • I get `missing operator in expression` on `vpbroadcastd ymm0, [4 bytes of JJJJ]` – Jarek N Mar 13 '18 at 05:38

1 Answers1

-1

When you need the code be fast, avoid memory access, jumps and complex instructions. I would scan the string twice: once to find the end marker, and then to find the searched character:

FindChar_2 PROC
MOV ESI, OFFSET DataString
XOR EAX,EAX
XOR ECX,ECX
XOR EDX,EDX
NOT EAX ; Let AL=EndOfString marker.
NOT ECX ; Let ECX=Max.integer.
MOV EDI,ESI
CLD
REPNE SCASB ; ECX -= String size (-9 in this example).
NOT ECX ; ECX= String size (8).
MOV AL,'J' ; The searched needle.
MOV EDI,ESI ; Restore the haystack pointer.
REPNE SCASB ; The actual search. 
  ; It returns CF if not found, because the needle is below 0FFH.
CMC ; Invert the logic.
MOV EAX,EDX ; Return false.
ADC EAX,EDX ; Return true if found.
RET
FindChar_2 ENDP
vitsoft
  • 5,515
  • 1
  • 18
  • 31
  • If you want it to be fast, you should use `mov eax,-1` instead of `xor`-zero/`not`. – Peter Cordes Mar 13 '18 at 07:50
  • 1
    This is nowhere near good in any kind of absolute sense, especially not on a CPU with MMX (let alone SSE2 or AVX2). But even without, it's pretty crap to scan the string slowly twice. `repne scasb` only checks one byte at a time on current CPUs; only `rep movs` and `rep stos` have "fast strings" optimizations. According to http://agner.org/optimize/, `rep scas` for `n` counts takes >= `2n` cycles, so your code for a string of length `n` would take on the order of `4n` cycles for medium to long strings. The OP's code has a lot of memory refs, but probably runs ~1.5 cycles per byte on Haswell. – Peter Cordes Mar 13 '18 at 07:58
  • For short strings, the question is whether a branch mispredict costs more than any setup / startup overhead `rep scas` does. if it's anything like `rep movs` (https://stackoverflow.com/q/33902068/), that's worse than a branch miss. Referencing the same memory operand twice inside the loop is dumb, but your code loops separately over the string, all the way to the end even if the target character is early! Those memory refs don't hurt much on a CPU that can do 2 loads per clock from L1d cache. (Both are too slow to bottleneck on memory even for large strings.) – Peter Cordes Mar 13 '18 at 08:02
  • Anyway yes the OP's code is crap, but I think this is worse. The OP's code might run more like one per 3 cycles, bottlenecked on branch throughput, on Sandybridge. Haswell added an extra execution port for not-taken branches, so I it could chew through those 5 fused-domain uops in 1.5 cycles per iteration. It could easily run 1 cycle per iteration if written like a `do{}while()`, but this answer has little room for improvement other than a rewrite from scratch. – Peter Cordes Mar 13 '18 at 08:05
  • Would my solution have less total clock cycles than this? – Jarek N Mar 13 '18 at 10:25