It's normal for a line of terminal input to include the terminating newline. If RARS doesn't allow the user to "submit" input without a newline, you could just zero the last byte. But the RARS read-string ecall
very inconveniently doesn't return a length, so searching for a \0
is no better than just searching for a \n
.
(A Unix read
system call will return a length: RARS has that as ecall
#63 read
which returns a length in a0
, so you could use that to read input if it allows fd=0 for stdin.)
Loop efficiency
You're only doing one byte per loop iteration; the only thing you're saving is a byte-load every iteration (lb
), at the expense of a lot more ALU work.
The simple way looks like this, and is probably faster on most real-world RISC-V machines. (Especially if they have any cache, which makes it cheap to do multiple nearby loads instead of one wider load.) Unrolling some to hide load latency might be a good idea for high-performance in-order machines, if you really care about optimizing this loop for potentially large inputs. (Which you shouldn't for this use-case since it only runs once per user-input, so just keep it compact for code-size.)
li t1, '\n'
.loop: # do{
lbu t0, (a0)
addi a0, a0, 1
bne t0, t1, loop # }while(*p != '\n')
# assume the string will *always* contain a newline,
# otherwise check for 0 as well
sb zero, -1(a0)
# a0 points to one-past-the-end of the terminating 0
# so if you want the string length, you can get it by subtracting
But there's more to say about the design choices of your word-at-a-time loop:
Since RISC-V has a byte-store instruction, you don't need to mask the word where you found a newline and store the whole word, just sb x0, (position)
at the position where you found the newline, even if you find that position by incrementing a counter for every inner-loop shift count (which should also simplify that loop).
Also, storing a whole word is especially bad if your buffer isn't a whole number of aligned words: you don't want to do a non-atomic RMW of bytes past the end of your buffer. That's a very bad habit for thread-safety. (See also Erik's answer re: possible downsides of word-at-a-time in general, and Is it safe to read past the end of a buffer within the same page on x86 and x64?)
(If you were going to mask a word and store it, use not
instead of neg
/ addi -1
to invert the bits in your mask. not
is a pseudo-instruction for xori
with -1
. In general, you can ask a compiler for stuff like that, e.g. https://godbolt.org/z/EPGYGosKd shows how clang implements x & ~mask
for RISC-V.)
Fast word-at-a-time
To actually quickly check a whole word at a time for a newline byte, do word ^ 0x0a0a0a0a
to map that byte value to 0, and other values to non-zero. Then use the bithack for finding if a word has a zero byte https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. (Like what glibc's portable-C fallback strlen
does: Why does glibc's strlen need to be so complicated to run quickly?). IIRC, it's not an exact test (false positive matches are possible), so you'd want to quickly check a whole word, then loop over the bytes checking one at a time to make sure. If none, go back into the word loop.
Of course even better would be if you had some SIMD support for doing 4 or 8 (or 16) byte-compares in parallel, with RV32 P (packed-SIMD) or RV32 V (vector) extensions.
If you're doing this on a buffer you didn't allocate, you'd probably want to do one unaligned load (after checking it's not going to cross a page or maybe cache-line boundary), then get to an alignment boundary for aligned word loads. Or loop byte-at-a-time until a word boundary. (Or double-word on RV64).