4

I'm working on a java.util.Scanner-like input reader for Rust, both as a learning project and because I haven't seen a powerful text input handler for Rust that I'm happy with.

The problem with regular language delimiters, outlined in issue #2 on my repository, is that they may be arbitrarily long, and so I'm not sure how to handle the case where the input buffer on BufRead does not fully contain either the start or the end delimiter.

In the OpenJDK 1.7 implementation, they take advantage of the fact that the regex engine supports partial matching, i.e., they are able to ask, "Is the input string a prefix to a member of the RegEx's language?". If this is the case, they wait for more input.

It seems to me that I cannot solve this problem without prefix matching, because otherwise I have to read the entire file into the buffer in order to prove that there is not a match to the regex. Specifically, this problem is when searching for the last prefixed delimiter: it has no memory impact whatsoever when searching for the terminating delimiter.

Note that because I am accepting arbitrary regexes from the API's users, I am not aware of a way to construct a regex that matches prefixes to words in the given regex's language. If someone knows how to do this algorithmically, I would accept that as the solution.

If there is a solution without partial matching of regexes, it is welcome as well.

For example:

Edit: My latest commit passes the first of these tests, but still requires partial matching to tackle the other two under my current line of thought.

/// This test will fail if we cannot read past the length of the buffer.
/// The buffer size is four characters, so it will read "hell". If we do
/// not continue past the buffer, then it is interpreted as if we have
/// reached EOF. This affects searching for the terminating delimiter.
#[test]
fn buffer_ends_before_delim() {
    let string: &[u8] = b"hello world";
    let mut br = BufReader::with_capacity(4, string);
    let mut test = Scanner::new(&mut br);

    assert_eq!(test.next(), Some(String::from("hello")));
}

/// This test will fail if we do not solve the above problem in a way that
/// preserves the tail of the original buffer, because in this test case the
/// terminating delimiter begins within the first buffer size and
/// ends within the second.
#[test]
fn buffer_ends_within_delim() {
    let string: &[u8] = b"foo  bar";
    let mut br = BufReader::with_capacity(4, string);
    let mut test = Scanner::new(&mut br);
    test.set_delim_str("  ");

    assert_eq!(test.next(), Some(String::from("foo")));
}

/// This test will fail if we cannot detect partial matches of the delimiter
/// when skipping over prefixed delimiters. Because the buffer size is 4, it
/// will read "aaaa", which is not in the language of /a+b/, however the
/// automaton is not in a dead state either: reading a "b" would put us in
/// an accepting state, thus we must read more input to know if the regex is
/// satisfied. Reading an additional character will result in "aaaab", which
/// is a valid delimiter in this language and should therefore be skipped.
#[test]
fn buffer_ends_within_start_delim() {
    let string: &[u8] = b"aaaabfoo";
    let mut br = BufReader::with_capacity(4, string);
    let mut test = Scanner::new(&mut br);
    test.set_delim(Regex::new(r"a+b").unwrap());

    assert_eq!(test.next(), Some(String::from("foo")));
}
hxtk
  • 325
  • 1
  • 9
  • Some examples would be helpful. Are you talking about recursion, some sort of balance ? Like .. end> ? –  Feb 02 '18 at 18:09
  • `otherwise I have to read the entire file into the buffer in order to prove that there is not a match to the regex` Well, even if you read in discrete blocks, you'd still have to eventually read the entire file to find that you have no match. –  Feb 02 '18 at 18:11
  • 4
    It's hard to tell without examples, but this sounds like a use case for searching on text streams: https://github.com/rust-lang/regex/issues/425 --- it's something we want, but it will be quite some time before it's in the `regex` crate proper. – BurntSushi5 Feb 02 '18 at 18:14
  • I've added the failing test cases to my post. With some thinking I can probably figure out the first two on my own, but I'm utterly lost on the third one. – hxtk Feb 02 '18 at 18:22
  • @sin that only applies to the terminating delimiter, and in that case I'm okay with it because then the entire file is the correct return value. With the starting delimiter, if the entire buffer is not a prefix to a word in the delimiting language then I have shown that the prefixing delimiter(s) terminate within the buffer. – hxtk Feb 02 '18 at 18:30
  • 1
    I'm not quite sure I understand your question. You seem to be aware of what partial matching is, and that some regex engines support this feature (PCRE, Boost, Java). Are you looking for a solution using the Rust regex library, or are you building your own regex engine? If that helps, I wrote a [messy hack for JavaScript](https://stackoverflow.com/a/41580048/3764814) which converts a regex pattern into a partially-matching one, but it only works because JS has a stupidly simple regex engine, and fails to split backreferences. To get the real deal, you need an option in your regex engine. – Lucas Trzesniewski Feb 04 '18 at 01:11
  • @Lucas, I'm not terribly concerned with how the solution is acheived. Simpler is better, but I'd like to avoid introducing poorly-maintained dependencies and staying in pure rust is a requirement. I was hoping to avoid writing my own regex engine because I suspect that will more than double the size of my codebase, but if someone makes a convincing argument that it's the easiest way then I will. I think you may have enabled me to avoid that with your JS hack, though. Thanks! – hxtk Feb 04 '18 at 03:11
  • 1
    I don't know Rust so I won't be able to help much, but the simplest solution (if Rust doesn't support it out of the box) would be to use PCRE, which supports partial matching (I don't know if using external libraries still qualifies as "pure Rust" for you). This will obviously require you to switch the regex engine to PCRE, but fortunately it's among the best. – Lucas Trzesniewski Feb 04 '18 at 11:39

0 Answers0