How to parse slightly ambiguous data using nom?

Question

In RFC1738, the BNF for domainlabel is the following:

domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit

That is, it's either an alphadigit, or it's a string where the first/last characters have to be an alphadigit but the intermediate characters can be an alphadigit or a dash.

How do I implement this with nom? Ignoring the single character scenario to simplify the case, my final attempt is:

fn domain_label(s: &[u8]) -> IResult<&[u8], (&[u8], &[u8], &[u8])> {
    let left = take_while_m_n(1, 1, is_alphanumeric);
    let middle = take_while(|c| is_alphanumeric(c) || c == b'-');
    let right = take_while_m_n(1, 1, is_alphanumeric);
    let whole = tuple((left, middle, right));
    whole(s)
}

The problem with this is that middle can consume the last character and hence right fails because there is no character to consume.

println!("{:?}", domain_label(b"abcde"));
Err(Error(([], TakeWhileMN)))

Parsers should be able to attempt all possible consumption paths, but how to do this with nom?

edwardw · Accepted Answer · 2019-09-20T07:09:54.317

3

domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit

It is a series of alphanumeric sequence delimited by any number of character -. So here is one way to do it:

use nom::bytes::complete::{tag, take_while1};
use nom::character::is_alphanumeric;
use nom::combinator::recognize;
use nom::multi::{many1, separated_list};
use nom::IResult;

fn domain_label(input: &[u8]) -> IResult<&[u8], &[u8]> {
    let alphadigits = take_while1(is_alphanumeric);
    let delimiter = many1(tag(b"-"));
    let parser = separated_list(delimiter, alphadigits);

    recognize(parser)(input)
}

fn main() {
    let (_, res) = domain_label(b"abcde").unwrap();
    assert_eq!(res, b"abcde");
    let (_, res) = domain_label(b"abcde-123-xyz-").unwrap();
    assert_eq!(res, b"abcde-123-xyz");
    let (_, res) = domain_label(b"rust-lang--1---37---0.org").unwrap();
    assert_eq!(res, b"rust-lang--1---37---0");
}

Notice, you don't need individual parts of a successful parsing. The result is just the longest input that conforms to the domain label BNF. That's where the recognize combinator comes in.

edited Sep 20 '19 at 07:09

answered Sep 13 '19 at 14:48

edwardw

12,652
3
40
51

Thanks. That solves the immediate problem, but it sidestepped the issue I noted which is that ambiguous parsing could require backtracking. Does nom do this or must one rely on converting the grammar into a parsing solution that doesn't require backtracking? – Listerone Sep 13 '19 at 15:16
@Listerone, in this specific case, there's no ambiguity. And you may notice my 3rd test case, which doesn't consume all the input. You can continue parse whatever left with any parser you wish. That's also the strategy to deal with failure / ambiguity; you try other options with remaining input. Until there's no option left or all input consumed w/o success, then that's a hard failure. – edwardw Sep 13 '19 at 15:26
@Listerone there's also [`peek`](https://docs.rs/nom/5.0.1/nom/combinator/fn.peek.html) combinator that is specifically designed to **not** consume the input. – edwardw Sep 13 '19 at 15:34

How to parse slightly ambiguous data using nom?

1 Answers1