1

I'm trying to port this Python function that returns true if each character in the pattern appears in the test string in order.

def substr_match(pattern, document):
    p_idx, d_idx, p_len, d_len = 0, 0, len(pattern), len(document)
    while (p_idx != p_len) and (d_idx != d_len):
        if pattern[p_idx].lower() == document[d_idx].lower():
            p_idx += 1
        d_idx += 1
    return p_len != 0 and d_len != 0 and p_idx == p_len

This is what I have at the moment.

fn substr_match(pattern: &str, document: &str) -> bool {
    let mut pattern_idx = 0;
    let mut document_idx = 0;
    let pattern_len = pattern.len();
    let document_len = document.len();

    while (pattern_idx != pattern_len) && (document_idx != document_len) {
       let pat: Vec<_> = pattern.chars().nth(pattern_idx).unwrap().to_lowercase().collect();
       let doc: Vec<_> = document.chars().nth(document_idx).unwrap().to_lowercase().collect();

        if pat == doc {
            pattern_idx += 1;
        }
        document_idx += 1;
    }

    return pattern_len != 0 && document_len != 0 && pattern_idx == pattern_len;
}

I tried s.chars().nth(n) since Rust doesn't seem to allow string indexing, but I feel there is a more idiomatic way of doing it. What would be the preferred way of writing this in Rust?

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
sfactor
  • 12,592
  • 32
  • 102
  • 152

3 Answers3

3

Here is mine:

fn substr_match(pattern: &str, document: &str) -> bool {
    let pattern_chars = pattern.chars().flat_map(char::to_lowercase);
    let mut doc_chars = document.chars().flat_map(char::to_lowercase);

    'outer: for p in pattern_chars {
        for d in &mut doc_chars {
            if d == p {
                continue 'outer;
            }
        }
        return false;
    }    
    true
}
hwiechers
  • 14,583
  • 8
  • 53
  • 62
  • Fixed it, I guess. [playground link](https://play.rust-lang.org/?gist=d2364b0ce4ba56c2a482bfe97275c0c9&version=stable) – hwiechers Nov 05 '17 at 06:54
3

The other answers mimic the behavior of the Python function you started with, but it may be worth trying to make it better. I thought of two test cases where the original function may have surprising behavior:

>>> substr_match("ñ", "in São Paulo")
True
>>> substr_match("", "")
True

Hmm.

(The first example may depend on your input method; try copying and pasting. Also, if you can't see them, the special characters in the second example are flag emoji for the United States, Ukraine, and Slovakia.)

Without getting into why these tests fail or all the other things that could potentially be undesired, if you want to correctly handle Unicode text, you need to, at minimum, operate on graphemes instead of code points (this question describes the difference). Rust doesn't provide this feature in the standard library, so you need the unicode-segmentation crate, which provides a graphemes method on str.

extern crate unicode_segmentation;

use unicode_segmentation::UnicodeSegmentation;

fn substr_match(pattern: &str, document: &str) -> bool {
    let mut haystack = document.graphemes(true);
    pattern.len() > 0 && pattern.graphemes(true).all(|needle| {
        haystack
            .find(|grapheme| {
                grapheme
                    .chars()
                    .flat_map(char::to_lowercase)
                    .eq(needle.chars().flat_map(char::to_lowercase))
            })
            .is_some()
    })
}

Playground, test cases provided.

This algorithm takes advantage of several convenience methods on Iterator. all iterates over the pattern. find short-circuits, so whenever it finds the next needle in haystack, the next call to haystack.find will start at the following element.

(I thought this approach was somewhat clever, but honestly, a nested for loop is probably easier to read, so you might prefer that.)

The last "tricky" bit is case-insensitive string comparison, which is inherently language-dependent, but if you're willing to accept only unconditional mappings (those that apply in any language), char::to_lowercase does the trick. Rather than collect the result into a String, though, you can use Iterator::eq to compare the sequences of (lowercased) characters.


One other thing you may want to consider is Unicode normalization -- this question is a good place for the broad strokes. Fortunately, Rust has a unicode-normalization crate, too! And it looks quite easy to use. (You wouldn't necessarily want to use it in this function, though; instead, you might normalize all text on input so that you're dealing with the same normalization form everywhere in your program.)

Community
  • 1
  • 1
trent
  • 25,033
  • 7
  • 51
  • 90
  • Mutability is tricky; I find the fact that `haystack` is borrowed mutably, and therefore the subsequent invocations of the lambda in `.all(...)` do NOT operate on the same data, rather non-obvious. I did not realize it immediately and thought that it was starting from the beginning of `haystack` each time at first glance :( – Matthieu M. Nov 06 '17 at 10:44
  • @MatthieuM. Yeah, I honestly feel a little guilty for writing it because I usually go for obvious over clever :/ On the other hand, if you can't be clever on Stack Overflow, where *can* you be? – trent Nov 06 '17 at 13:08
  • To be honest, I had the same double-take about @hwiechers' solution with for loops, as it also relies on keeping the iterators in between iterations of the inner loop, so it's not specific to your solution. Although, in the explicit loop, there is this weird `&mut` sticking out in front to catch your attention. – Matthieu M. Nov 06 '17 at 13:24
2

str::chars() returns an iterator. Iterators return elements from a sequence one at a time. Specifically, str::chars() returns characters from a string one at a time. It's much more efficient to use a single iterator to iterate over a string than to create a new iterator each time you want to look up a character, because s.chars().nth(n) needs to perform a linear scan in order to find the nth character in the UTF-8 encoded string.

fn substr_match(pattern: &str, document: &str) -> bool {
    let mut pattern_iter = pattern.chars();
    let mut pattern_ch_lower: String = match pattern_iter.next() {
        Some(ch) => ch,
        None => return false,
    }.to_lowercase().collect();

    for document_ch in document.chars() {
        let document_ch_lower: String = document_ch.to_lowercase().collect();
        if pattern_ch_lower == document_ch_lower {
            pattern_ch_lower = match pattern_iter.next() {
                Some(ch) => ch,
                None => return true,
            }.to_lowercase().collect();
        }
    }

    return false;
}

Here, I'm demonstrating two ways of using iterators:

  1. To iterate over the pattern, I'm using the next method manually. next returns an Option: Some(value) if the iterator hasn't finished, or None if it has.
  2. To iterate over the document, I'm using a for loop. The for loop does the work of calling next and unwrapping the result until next returns None.

One thing to notice is that I'm using a return expression inside a match expression (twice). Since a return expression doesn't produce a value, the compiler knows that its type doesn't matter. In this case, on the Some arm, the result is a char, so the whole match evaluates to a char.


We could also do this with two nested for loops:

fn substr_match(pattern: &str, document: &str) -> bool {
    if pattern.len() == 0 {
        return false;
    }

    let mut document_iter = document.chars();
    for pattern_ch in pattern.chars() {
        let pattern_ch_lower: String = pattern_ch.to_lowercase().collect();
        for document_ch in &mut document_iter {
            let document_ch_lower: String = document_ch.to_lowercase().collect();
            if pattern_ch_lower == document_ch_lower {
                break;
            }
        }

        return false;
    }

    return true;
}

There are two things to notice here:

  1. We need to handle the case where the pattern is empty without using the iterator.
  2. In the inner loop, we don't want to restart from the start of the document when we move to the next pattern character, so we need to reuse the same iterator over the document. When we write for x in iter, the for loop takes ownership of iter; to avoid that, we must write &mut iter instead. Mutable references to iterators are iterators themselves, thanks to the blanket implementation impl<'a, I> Iterator for &'a mut I where I: Iterator + ?Sized in the standard library.
Francis Gagné
  • 60,274
  • 7
  • 180
  • 155