2

I have an application that requires pluggable modules to implement arbitrary functions from one array of bytes to another. Some of the functions could be computationally intensive. One example of a function might involve summing many numbers - each of which, in Javascript, would require a BigNum... I can imagine an implementation in assembly language making extensive use of ADC (add with carry).

I found this post and comments from 2017 - much of which seems to echo what I'm thinking today. I am not aware of any progress towards supporting carry/overflow flags in WASM (or WASM2). Please correct me if I've overlooked something.

I'm aware that WASM is usually generated by compiling higher-level languages... like Rust or C... but I envision hand-coding Web-Assemly-Text to implement my plugins - so I'm not especially interested optimisations (when compiling to WASM) in LLVM etc. I would like to discover a model for the computational cost to execute each WASM instruction... in order to guide design of maximally efficient wasm modules - each of which will implement a relatively simple, but perhaps computationally expensive, algorithm. I'm aware that there are complications with distinctions between interpretation, JIT and AOT compilation - and I know that different target hardware will have different characteristics for each scenario. Despite this, I feel it would be extremely useful to have at least an estimate of the relative execution cost between different web-assembly code fragments. For example, what are the likely relative costs of i32.add and v128.add; is it cheaper to multiply by 2 or shift-left by 1... etc.

Is solid information currently available? Are there efforts to implement benchmarks with an ambition to provide helpful execution cost estimates for a range of target hardware? Are there any tools that can provide execution cost estimates for me when provided a WASM code fragment?

aSteve
  • 1,956
  • 1
  • 20
  • 35
  • WASM is still input for a JIT optimizing compiler, so I'd assume multiply by 2 can still get optimized to `a += a` or `a <<= 1`, whichever is cheaper for the target ISA. – Peter Cordes Apr 24 '23 at 22:28
  • Agree - a good WASM AOT/JIT compile should pick the best 'doubling' instruction for the target architecture. It would be good to find out if this happens in practise... and how far this JIT/AOT optimisation can be pushed / relied upon. For example: can I rely upon the JIT/AOT compilation to detect when to use the carry flag and ADC (which are, almost certainly, available on the target ISA) even if my WASM expresses the arithmetic without explicit use of any carry flag (because WASM doesn't afford such instructions). Do some WASM idioms optimise more readily than others? – aSteve Apr 25 '23 at 00:18
  • 1
    Yeah, that part of the question is more interesting. For ADC with both carry-in *and* carry-out, clang9 and later recognize `tmp = b+carry_in; sum = a+tmp; carry_out = (tmp – Peter Cordes Apr 25 '23 at 00:26
  • It should be possible to look at the target-specific JIT asm for some machine, using some implementation like V8. A cost-model is going to somewhat depend on the target CPU microarchitecture, since cost is measured in throughput and latency. (Not a single number you can just add up across machine instructions in a function: [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) / [What considerations go into predicting latency for operations on modern superscalar processors?](https://stackoverflow.com/q/51607391/224132) – Peter Cordes Apr 25 '23 at 00:30
  • When I posted this question, I hoped to find execution cost information for WASM instructions comparable to that commonly available for hardware CPU instructions. I would like to be able to anticipate relative performance of WASM implemented functions... before benchmarking them on a a convincing range of hardware, JIT/OOT compilers and interpreters. It seems, to me, that an absence of any guidance about the relative costs of WASM instructions is a shortcoming in the documentation of WASM... and that this represents a barrier to WASM adoption for some use cases. – aSteve May 01 '23 at 11:15
  • Given that they're inputs for JIT optimizers, it's not remotely plausible that you could nail down the exact throughput vs. latency cost of a left-shift by 2. When JITing for x86-64 or AArch64, it might happen literally for free as part of an add using the shift result, with a single instruction like x86 `lea eax, [rdi + rsi*4]` or AArch64 `add w0, w0, w1, lsl 2` (https://godbolt.org/z/1e5nq1oY3). Or for RISC-V, `a + (b<<2)` would take two separate instructions. So there's no portable exact cost we can put on the latency or front-end bandwidth cost of the two WASM instructions for it. – Peter Cordes May 01 '23 at 12:58
  • It sounds like the cost of evaluating WASM operations could be documented with upper and lower cost bounds for each target architecture (perhaps on a per JIT compiler implementation basis). More important to me than exact execution times for each WASM instruction would be an (approximate) model of the expected relative costs of WASM instructions. Without any execution cost model for WASM instructions, it is hard to objectively compare relative efficiency for two distinct WASM implementations of a function. [P.S. ta for the godbolt.org link - interesting resource.] – aSteve May 01 '23 at 15:16
  • @aSteve You don't get such a model as instruction cost varies wildly between different architectures and WASM compilers/interpreters. – fuz May 11 '23 at 15:02
  • @fuz I'm aware that a satisfactory model will not be of the trivial kind we see for traditional CPU instruction sets... but it follows neither that there can be no model... nor that there can be no meaningful information about the relative cost of WASM instructions. I think my question about an execution cost model remains relevant. Without any such model, it becomes impossible to reason about the performance of anything that is written using, or is compiled to, WASM instructions. Absence of an execution cost model undermines WASM in many contexts. – aSteve May 21 '23 at 14:17
  • @aSteve Yes, it is largely impossible to reason about the performance of WASM code in the same way you would reason about conventional assembly code. Instructions that are fast on one architecture may be dead slow on another and vice versa. And it may even vary on the same system depending on the mood of the JIT. – fuz May 21 '23 at 15:00
  • @fuz I understand your perspective... I hope you also understand mine. Without any cost model for execution of WASM, it becomes completely impossible to reason about performance. In such circumstances, even algorithmic complexity would be rendered irrelevant. This would be silly... it seems obvious that some sort of model must be assumed by anyone wishing to use WASM seriously. I was hoping to discover documentation of one or more such models. If none are documented, that's a significant WASM (and/or WASM implementation) shortcoming. – aSteve May 21 '23 at 17:41
  • @aSteve There's no need to repeat your arguments, I have understood them the first time. But yes: there is no cost model and it is impossible to reason about performance. You can only use common knowledge to optimise for the general case. You know, the same way programmers of high-level languages do. How do you expect a cost model to be given when WASM is transpiled to various architectures, each of which has a different cost model and possibly optimised, changing instruction cost dynamically in essentially unpredictable ways? – fuz May 21 '23 at 17:48
  • I know that you want a cost model really badly, but WASM by design does not and cannot have one. In fact, I have never seen a byte code with any sort of performance model. It's just not possible. You presenting arguments why it would make sense to have one doesn't change the fact that it does not exist and cannot exist. – fuz May 21 '23 at 17:52
  • @fuz :-) High level languages don't offer the models you imagine. They do have models in the sense that capable programmers do have an understanding of the performance merits of one code fragment relative to another. Programmers in C, for example, are not completely oblivious to the performance implications of source code changes until they are benchmarked. Many decisions, in practise, are made independent of even knowing the target architectures to which the code may be compiled in future. You present the difficulty establishing a perfect cost model as impossibility of any model whatsoever. – aSteve May 21 '23 at 18:24

0 Answers0