I'm working on a java.util.Scanner
-like input reader for Rust, both as a learning project and because I haven't seen a powerful text input handler for Rust that I'm happy with.
The problem with regular language delimiters, outlined in issue #2 on my repository, is that they may be arbitrarily long, and so I'm not sure how to handle the case where the input buffer on BufRead
does not fully contain either the start or the end delimiter.
In the OpenJDK 1.7 implementation, they take advantage of the fact that the regex engine supports partial matching, i.e., they are able to ask, "Is the input string a prefix to a member of the RegEx's language?". If this is the case, they wait for more input.
It seems to me that I cannot solve this problem without prefix matching, because otherwise I have to read the entire file into the buffer in order to prove that there is not a match to the regex. Specifically, this problem is when searching for the last prefixed delimiter: it has no memory impact whatsoever when searching for the terminating delimiter.
Note that because I am accepting arbitrary regexes from the API's users, I am not aware of a way to construct a regex that matches prefixes to words in the given regex's language. If someone knows how to do this algorithmically, I would accept that as the solution.
If there is a solution without partial matching of regexes, it is welcome as well.
For example:
Edit: My latest commit passes the first of these tests, but still requires partial matching to tackle the other two under my current line of thought.
/// This test will fail if we cannot read past the length of the buffer.
/// The buffer size is four characters, so it will read "hell". If we do
/// not continue past the buffer, then it is interpreted as if we have
/// reached EOF. This affects searching for the terminating delimiter.
#[test]
fn buffer_ends_before_delim() {
let string: &[u8] = b"hello world";
let mut br = BufReader::with_capacity(4, string);
let mut test = Scanner::new(&mut br);
assert_eq!(test.next(), Some(String::from("hello")));
}
/// This test will fail if we do not solve the above problem in a way that
/// preserves the tail of the original buffer, because in this test case the
/// terminating delimiter begins within the first buffer size and
/// ends within the second.
#[test]
fn buffer_ends_within_delim() {
let string: &[u8] = b"foo bar";
let mut br = BufReader::with_capacity(4, string);
let mut test = Scanner::new(&mut br);
test.set_delim_str(" ");
assert_eq!(test.next(), Some(String::from("foo")));
}
/// This test will fail if we cannot detect partial matches of the delimiter
/// when skipping over prefixed delimiters. Because the buffer size is 4, it
/// will read "aaaa", which is not in the language of /a+b/, however the
/// automaton is not in a dead state either: reading a "b" would put us in
/// an accepting state, thus we must read more input to know if the regex is
/// satisfied. Reading an additional character will result in "aaaab", which
/// is a valid delimiter in this language and should therefore be skipped.
#[test]
fn buffer_ends_within_start_delim() {
let string: &[u8] = b"aaaabfoo";
let mut br = BufReader::with_capacity(4, string);
let mut test = Scanner::new(&mut br);
test.set_delim(Regex::new(r"a+b").unwrap());
assert_eq!(test.next(), Some(String::from("foo")));
}