48

I have a &[u8] slice over a binary buffer. I need to parse it, but a lot of the methods that I would like to use (such as str::find) don't seem to be available on slices.

I've seen that I can covert both by buffer slice and my pattern to str by using from_utf8_unchecked() but that seems a little dangerous (and also really hacky).

How can I find a subsequence in this slice? I actually need the index of the pattern, not just a slice view of the parts, so I don't think split will work.

JasonN
  • 1,339
  • 1
  • 15
  • 27
  • 4
    There is interest is expanding the concept of `Pattern` to arbitrary slices: [comment](https://github.com/rust-lang/rust/issues/27721#issuecomment-185405392), [RFC](https://github.com/rust-lang/rfcs/issues/984). – Shepmaster Mar 09 '16 at 20:41

4 Answers4

45

Here's a simple implementation based on the windows iterator.

fn find_subsequence(haystack: &[u8], needle: &[u8]) -> Option<usize> {
    haystack.windows(needle.len()).position(|window| window == needle)
}

fn main() {
    assert_eq!(find_subsequence(b"qwertyuiop", b"tyu"), Some(4));
    assert_eq!(find_subsequence(b"qwertyuiop", b"asd"), None);
}

The find_subsequence function can also be made generic:

fn find_subsequence<T>(haystack: &[T], needle: &[T]) -> Option<usize>
    where for<'a> &'a [T]: PartialEq
{
    haystack.windows(needle.len()).position(|window| window == needle)
}
Francis Gagné
  • 60,274
  • 7
  • 180
  • 155
  • Very nice. I think I basically did it by hand with two nested for loops. The subarrays I'm looking for are all very small, so doing something more complex like KMP would be useless for my issues. – JasonN Mar 10 '16 at 04:31
  • 6
    While this is a short and nice solution, please note that the algorithm runs in O(|haystack| * |needle|). This won't matter in most cases, but for more advanced and (asymptotically) faster algorithms, see [String searching algorithm (Wikipedia)](https://en.wikipedia.org/wiki/String_searching_algorithm). – Lukas Kalbertodt Mar 10 '16 at 10:56
  • 1
    This winds up being unacceptably slow. windows().position() is 100x slower than two nested loops. – JasonN Mar 10 '16 at 15:08
  • 2
    @JasonN 100x sounds extreme. Are you sure you're compiling with optimizations? – Veedrac Mar 11 '16 at 13:21
  • 2
    @JasonN how did you do with two nested loops? This solution should compile into the same thing as one loop. – SOFe Apr 26 '20 at 08:52
  • I would suggest adding an explicit check for `needle.is_empty()`, since `haystack.windows(0)` would panic, which is probably not the wanted behavior. – Frxstrem Dec 10 '22 at 22:54
5

I don't think the standard library contains a function for this. Some libcs have memmem, but at the moment the libc crate does not wrap this. You can use the twoway crate however. rust-bio implements some pattern matching algorithms, too. All of those should be faster than using haystack.windows(..).position(..)

maxschlepzig
  • 35,645
  • 14
  • 145
  • 182
aseyboldt
  • 1,090
  • 8
  • 14
3

I found the memmem crate useful for this task:

use memmem::{Searcher, TwoWaySearcher};

let search = TwoWaySearcher::new("dog".as_bytes());
assert_eq!(
    search.search_in("The quick brown fox jumped over the lazy dog.".as_bytes()),
    Some(41)
);
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Wez Furlong
  • 4,727
  • 1
  • 29
  • 34
2

How about Regex on bytes? That looks very powerful. See this Rust playground demo.

extern crate regex;

use regex::bytes::Regex;

fn main() {
    //see https://doc.rust-lang.org/regex/regex/bytes/

    let re = Regex::new(r"say [^,]*").unwrap();

    let text = b"say foo, say bar, say baz";

    // Extract all of the strings without the null terminator from each match.
    // The unwrap is OK here since a match requires the `cstr` capture to match.
    let cstrs: Vec<usize> =
        re.captures_iter(text)
          .map(|c| c.get(0).unwrap().start())
          .collect();

    assert_eq!(cstrs, vec![0, 9, 18]);
}
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Robert Cutajar
  • 3,181
  • 1
  • 30
  • 42