8

I tried to implement this regular expression for checking if a string ("username") has a length between 3 and 30, contains only letters (a-z), numbers (0-9), and periods (.) (not consecutive):

use regex::Regex; // 1.3.5

fn main() {
    Regex::new(r"^(?=.{3,30}$)(?!\.)(?!.*\.$)(?!.*?\.\.)[a-z0-9.]+$").unwrap();
}

When trying to compile the regex, I get this error:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
   r"^(?=.{3,30}$)(?!\.)(?!.*\.$)(?!.*?\.\.)[a-z0-9.]+$").unwrap();
     ^^^
error: look-around, including look-ahead and look-behind, is not supported
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Is there an alternative regex or ways to validate strings with these requirements?

I could remove the length {3,30} and get string length as suggested, but for the second part (?!\.)(?!.*\.$)(?!.*?\.\.)[a-z0-9.]+$ (prevent consecutive dots)?

nbari
  • 25,603
  • 10
  • 76
  • 131
  • 3
    *has a length between 3 and 30* — [Get the String length in characters in Rust](https://stackoverflow.com/q/46290655/155423) – Shepmaster Apr 28 '20 at 17:21
  • Also, if you really want to reject the multiple dots, and use the length with just regular expressions, you could use two of them, the first `^[a-z0-9\.]{3,30}$` and a second regex `\.\.` and check that the first matches, and that the second doesn't match. But I think the solution below in my answer, along with a length check would be slightly faster. But, this might be a closer match to your model of the problem. – David Brown Apr 28 '20 at 19:53

1 Answers1

11

The issue at hand is what is meant by "regular expression". Wikipedia has good information on this, but a simple summary is that a regular language is one defined with a few simple operations, including literal matches, alternation, and the Kleene star (match zero or more). Regex libraries have added features that don't extend this language, but make it easier to use (such as being able to say [a-z] instead of (a|b|c|d|e|f...|z)).

Then, along came Perl, which implemented support for regular expressions. However, instead of using the commonly used NFA/DFA implementation for regular expressions, it implemented them using backtracking. There are two consequences of this, one, it allowed things beyond regular languages to be added, such as backtracking, and two, it can be really, really slow.

Many languages used these backtracking implementations of regular expressions, but there has been a somewhat recent resurgence of removing the features from the expressions that make them difficult to implement efficiently, specifically backtracking. Go has done this, the Re2 library is a C/C++ implementation of this. And, as you've discovered the regex crate also works this way. The advantage is that it always matches in linear time.

For your particular example, what you are trying to match is indeed still a regular language, it just has to be expressed differently. Let's start with the easy part, matching the characters, but not allowing consecutive dots. Instead of thinking of it this way, think of it as matching possibly a dot between the characters, but the characters themselves aren't options. In other words, we can match with: [a-z0-9](\.?[a-z0-9])*. We first match a single character. If you want to allow this to start with a dot, you could remove this part. Then we need zero or more occurrences of an optional dot followed by a single non-dot character. You could append a \.? if you want to allow a dot at the end.

The second requirement, of 3-30 characters would make this regex rather complicated, because our repeated sequence is of 1 or 2 characters. I would suggest, instead, just checking the length programmatically in addition to checking the regex. You could also make a second regex that checks the length, and check that both match (Regular languages do not have an and operation).

You may also find, depending on how your are matching, you may have to anchor the match (putting a ^ at the start and a $ at the end).

A solution to the full problem:

use regex::Regex; // 1.3.5

fn main() {
    let pat = Regex::new(r"^[a-z0-9](\.?[a-z0-9])*$").unwrap();
    let names = &[
        "valid123",
        "va.li.d.12.3",
        ".invalid",
        "invalid.",
        "double..dot",
        "ss",
        "really.long.name.that.is.too.long",
    ];
    for name in names {
        let len = name.len();
        let valid = pat.is_match(name) && len >= 3 && len <= 30;
        println!("{:?}: {:?}", name, valid);
    }
}
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
David Brown
  • 217
  • 4
  • 9
  • 2
    Great answer! One small note: "and" (intersection) and complement are closed operations over regular languages, so a true "regular" expression engine can have an "and" operator. It's just hard to implement efficiently. (And also tend to be tricky to reason about.) – BurntSushi5 Apr 28 '20 at 22:54