How do I match non-letters and non-numbers after a bunch of numbers?

Question

I'm using Ruby 2.4. I want to match a bunch of non-letter and numbers, followed by one or more numbers, followed by an arbitrary amount of non-letters and numbers. However, this string

2.4.0 :001 > token = "17 Milton,GA"
 => "17 Milton,GA"
...
2.4.0 :004 > Regexp.new("\\A([[:space:]]|[^\p{L}^0-9])*\\d+[^\p{L}^0-9]*\\z").match?(token.downcase)
 => true

is matching my regular expression and I dont' want it to since there are letters that follow the number. What do I need to adjust in my regexp so that the only thing I can match after the numbers will be non-letters and non-numbers?

Non letters, non-numbers after a bunch of numbers `(?<=\d)[\W_]+` — , May 20 '17 at 19:34
What did you mean to match with `[^\p{L}^0-9]`? Any char but letter and digit? Try `/\A[^[:alnum:]]*\d+[^[:alnum:]]*\z/`. BTW, I think your regex might work if you add a backslash to `\p` => `\\p` since you are using a double quoted string literal in a `Regexp.new` constructor rather than a regex literal. — Wiktor Stribiżew, May 20 '17 at 19:54

score 3 · Accepted Answer · answered May 20 '17 at 19:59

3

There are a couple of issues with the regex.

1) When you are using a double quoted string literal in a Regexp.new constructor, to declare a literal backslash you need to double it (\p => \\p)

2) [^\p{L}^0-9] is is a wrong construct for any char but a letter and digit because the second ^ is treated as a literal ^ symbol. You need to remove the second ^ at least. You may also use [^[:alnum:]] to match any non-alphanumeric symbol.

3) The pattern above matches whitespaces, too, so you do not need to alternate it with [[:space]]. ([[:space:]]|[^\p{L}^0-9])* -> [^\p{L}0-9]*.

So, you may use your fixed Regexp.new("\\A[^\\p{L}0-9]*\\d+[^\\p{L}0-9]*\\z") regexp, or use

/\A[^[:alnum:]]*\d+[^[:alnum:]]*\z/.match?(token.downcase)

See the Rubular demo where your sample string is not matched with the regex.

Details:

\A - start of a string
[^[:alnum:]]* - 0+ non-alphanumeric chars
\d+ - 1+ digits
[^[:alnum:]]* - 0+ non-alphanumeric chars
\z - end of string.

answered May 20 '17 at 19:59

Wiktor Stribiżew

607,720
39
448
563

You just going to let in other numbers ? http://rubular.com/r/lk9zOsGJ1M – May 20 '17 at 21:15
Perhaps, that is OP intention since the original pattern contains the ASCII digit range. Surely, using [`/\A[^\p{L}\p{N}]*\d+[^\p{L}\p{N}]*\z/`](http://rubular.com/r/baBlzSW9RD) will handle all Unicode digits, but I am just not sure it is expected. – Wiktor Stribiżew May 20 '17 at 21:29
I wouldn't be so quick to interpret Unicode and how engines implement it And of course, something you can check yourself, (_per unicode 9_) `[[:alnum:]]` (117,347) and `[\p{L}\p{N}]` (118, 258) do not match identical items. It might be better to use `\w` whenever possible, but as well, `[^\W_]` leaves in all those `_` like characters. – May 20 '17 at 21:40

Cary Swoveland · Answer 2 · 2017-05-21T00:55:20.020

Here are a three ways to do that.

#1 Use a regular expression with a capture group

r = /
    \A                    # match beginning of string
    [^[[:alnum:]]]*       # match 0+ chars other than digits and lc letters
    (\d+)                 # match 1+ digits in capture group 1
    [^[[:alnum:]]]*       # match 0+ chars other than digits and lc letters
    \z                    # match end of string
    /x                    # free-spacing regex definition mode

"$ ^*123@-"[r, 1]         #=> '123'
"$ ^*123@-a?"[r, 1]       #=> nil
"$9^*123@-"[r, 1]         #=> nil

#2 Use a regular expression with \K and a positive lookahead

r = /
    \A                    # match beginning of string
    [^[[:alnum:]]]*       # match 0+ chars other than digits and lc letters
    \K                    # discard all matched so far
    \d+                   # match 1+ digits
    (?=[^[[:alnum:]]]*\z) # match 0+ chars other than digits and lc letters
                          # in a positive lookahead
    /x                    # free-spacing mode

"$ ^*123@-"[r]            #=> '123'
"$ ^*123@-a?"[r]          #=> nil
"$9^*123@-"[r]            #=> nil

Note that we cannot have a positive lookbehind in place of \K as Ruby does not support variable-length lookbehinds.

#3 Use simpler regular expressions together with String methods

def extract(str)
  return nil if str =~ /[[:alpha:]]/
  a = str.scan(/\d+/)
  a.size == 1 ? a.first : nil
end

extract("$ ^*123@-")      #=> '123'
extract("$ ^*123@-a?")    #=> nil
extract("$9^*123@-")      #=> nil

Where I have `[^[[:alnum:]]]*`, which uses the POSIX expression `[[:alnum:]]`, I formerly had `[^0-9a-z]*` (and `/ix`). I noticed @WiktorStribiżew used `[[:alnum:]]` instead. That's a better choice, so I adopted it. Not only is it simpler, but it recognizes non-ASCII (e.g., accented) characters. — Cary Swoveland, May 20 '17 at 21:49

How do I match non-letters and non-numbers after a bunch of numbers?

2 Answers2