1

How can I give nom's take_while an upper limit on the number of characters it should match? I want to take characters according to a certain condition, but at most N.

Since this will be a very performance critical part of the parser, I'd like to avoid using a bunch of individual take(1usize). Partly, because it feels awkward having to deal with single element slices one-by-one, but also because the compiler probably cannot know at compile time that they must have a size of 1, i.e., it will likely have to generate bound checks or size assertions, right?

Conceptually, a take_while with an upper limit would feel more appropriate. I was hoping I can do the counting with a mutable loop variable myself like so:

let mut i = 0;

let (input, _) = take_while(|c| {
    i += 1;
    c >= 0x80 && i < 9
})(input)?;

In fact, I could even extract the necessary information in the closure, which most like should result in a very efficient code generation. (The goal is to implement a certain kind of VarInt LSB encoding, i.e., I could update a local mutable variable x = x + (if last_byte { c } else { c & 0x7f }) << 7.)

Unfortunately take_while seems to allow only Fn and not FnMut which probably means this is impossible (why the limitation?). What else could I do to implement that nicely?

bluenote10
  • 23,414
  • 14
  • 122
  • 178
  • Maybe the solution is to encode the position of the token you want to apply your condition on directly in the token, with an `enumerate`-like transformation of your input? – jthulhu Apr 16 '22 at 14:43
  • Wouldn't [`take_while_m_n`](https://docs.rs/nom/latest/nom/bytes/complete/fn.take_while_m_n.html) fit your use case? – SirDarius Apr 16 '22 at 14:44
  • @SirDarius Indeed, I missed `take_while_m_n`, but I don't think it would solve the use case, because the condition in the last byte actually doesn't have to be true! Apparently `take_while_m_n` requires the condition to be satisfied over `m <= len <= n`. Also since I have to do the same bit masking again to compute the actual varint, I would think doing it all in one loop is faster than having to loop over the returned buffer again, doing similar work twice. But that requires benchmarking of course. – bluenote10 Apr 16 '22 at 14:52
  • Well actually I could consume the last byte manually separately, but its very awkward compared to `FnMut` closure solution. – bluenote10 Apr 16 '22 at 14:56

2 Answers2

3

Use a Cell to make your closure have interior mutability. Then it can have mutable state, but still implement Fn:

let i = Cell::new(0);

let (input, _) = take_while(|c| {
    i.set(i.get() + 1);
    c > 0x80 && i.get() < 9
})(input)?;
cdhowie
  • 158,093
  • 24
  • 286
  • 300
  • Thanks, didn't know that trick ([related question](https://stackoverflow.com/a/38677845/1804173)), this seems to do the job well. I take it, `Cell` is a zero-cost abstraction (at least this [comment](https://users.rust-lang.org/t/is-refcell-a-zero-cost-abstraction/61888/9) says so), and I don't have to worry about bypassing `Fn` here? For the curious: [resulting varint parser implementation](https://github.com/bluenote10/RustExperiments/blob/4e0d9dfa3a3922894fa169cf32cdcab05c1bb982/SerdeChecks/src/custom_file_format/deserialize.rs#L8-L31). – bluenote10 Apr 16 '22 at 16:19
  • 2
    @bluenote10 `Cell` can inhibit certain optimizations by permitting aliased modification, but that won't matter most of the time, and it shouldn't have any effects other than that. Also, you're not bypassing `Fn` here- you're still fulfilling the required contract, even if it feels weird. – Aiden4 Apr 16 '22 at 17:38
1

There is a nom-function take_while_m_n:

const N: usize = ...

fn take_native(input: &[u8]) -> IResult<&[u8], &[u8]> {
    take_while_m_n(0, N, |c| c > 0x80)(input)
}

However, with quick benchmark it seems to be remarkably slower than the Cell -answer (or benchmark optimizes wrongly, since the Cell -version takes only 1ns/iter compared to 13ns/iter for the take_while_m_n).

#![feature(test)]
extern crate test;

use std::cell::Cell;

use nom::{
    bytes::complete::{take_while, take_while_m_n},
    IResult,
};

fn take_cell(input: &[u8]) -> IResult<&[u8], &[u8]> {
    let i = Cell::new(0);

    let (input, output) = take_while(|c| {
        i.set(i.get() + 1);
        c > 0x80 && i.get() < 5
    })(input)?;

    Ok((input, output))
}

fn take_native(input: &[u8]) -> IResult<&[u8], &[u8]> {
    take_while_m_n(0, 4, |c| c > 0x80)(input)
}

#[cfg(test)]
mod tests {
    use super::*;

    const INPUT: &[u8] = &[
        0x81, 0x82, 0x83, 0x84, 0x81, 0x82, 0x83, 0x84, 0x81, 0x82, 0x83, 0x84, 0x81, 0x82, 0x83,
        0x84, 0x81, 0x82, 0x83, 0x84, 0x81, 0x82, 0x83, 0x84, 0x81, 0x82, 0x83, 0x84,
    ];

    #[bench]
    fn bench_cell(b: &mut test::Bencher) {
        assert_eq!(take_cell(INPUT).unwrap().1, &[0x81, 0x82, 0x83, 0x84]);

        b.iter(|| take_cell(INPUT).unwrap());
    }

    #[bench]
    fn bench_native(b: &mut test::Bencher) {
        assert_eq!(take_native(INPUT).unwrap().1, &[0x81, 0x82, 0x83, 0x84]);

        b.iter(|| take_native(INPUT).unwrap());
    }
}
Mika Vatanen
  • 3,782
  • 1
  • 28
  • 33