3

I'm using Ruby 2.4. I want to match a bunch of non-letter and numbers, followed by one or more numbers, followed by an arbitrary amount of non-letters and numbers. However, this string

2.4.0 :001 > token = "17 Milton,GA"
 => "17 Milton,GA"
...
2.4.0 :004 > Regexp.new("\\A([[:space:]]|[^\p{L}^0-9])*\\d+[^\p{L}^0-9]*\\z").match?(token.downcase)
 => true

is matching my regular expression and I dont' want it to since there are letters that follow the number. What do I need to adjust in my regexp so that the only thing I can match after the numbers will be non-letters and non-numbers?

Dave
  • 15,639
  • 133
  • 442
  • 830
  • Non letters, non-numbers after a bunch of numbers `(?<=\d)[\W_]+` –  May 20 '17 at 19:34
  • What did you mean to match with `[^\p{L}^0-9]`? Any char but letter and digit? Try `/\A[^[:alnum:]]*\d+[^[:alnum:]]*\z/`. BTW, I think your regex might work if you add a backslash to `\p` => `\\p` since you are using a double quoted string literal in a `Regexp.new` constructor rather than a regex literal. – Wiktor Stribiżew May 20 '17 at 19:54

2 Answers2

3

There are a couple of issues with the regex.

1) When you are using a double quoted string literal in a Regexp.new constructor, to declare a literal backslash you need to double it (\p => \\p)

2) [^\p{L}^0-9] is is a wrong construct for any char but a letter and digit because the second ^ is treated as a literal ^ symbol. You need to remove the second ^ at least. You may also use [^[:alnum:]] to match any non-alphanumeric symbol.

3) The pattern above matches whitespaces, too, so you do not need to alternate it with [[:space]]. ([[:space:]]|[^\p{L}^0-9])* -> [^\p{L}0-9]*.

So, you may use your fixed Regexp.new("\\A[^\\p{L}0-9]*\\d+[^\\p{L}0-9]*\\z") regexp, or use

/\A[^[:alnum:]]*\d+[^[:alnum:]]*\z/.match?(token.downcase)

See the Rubular demo where your sample string is not matched with the regex.

Details:

  • \A - start of a string
  • [^[:alnum:]]* - 0+ non-alphanumeric chars
  • \d+ - 1+ digits
  • [^[:alnum:]]* - 0+ non-alphanumeric chars
  • \z - end of string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • You just going to let in other numbers ? http://rubular.com/r/lk9zOsGJ1M –  May 20 '17 at 21:15
  • Perhaps, that is OP intention since the original pattern contains the ASCII digit range. Surely, using [`/\A[^\p{L}\p{N}]*\d+[^\p{L}\p{N}]*\z/`](http://rubular.com/r/baBlzSW9RD) will handle all Unicode digits, but I am just not sure it is expected. – Wiktor Stribiżew May 20 '17 at 21:29
  • I wouldn't be so quick to interpret Unicode and how engines implement it And of course, something you can check yourself, (_per unicode 9_) `[[:alnum:]]` (117,347) and `[\p{L}\p{N}]` (118, 258) do not match identical items. It might be better to use `\w` whenever possible, but as well, `[^\W_]` leaves in all those `_` like characters. –  May 20 '17 at 21:40
1

Here are a three ways to do that.

#1 Use a regular expression with a capture group

r = /
    \A                    # match beginning of string
    [^[[:alnum:]]]*       # match 0+ chars other than digits and lc letters
    (\d+)                 # match 1+ digits in capture group 1
    [^[[:alnum:]]]*       # match 0+ chars other than digits and lc letters
    \z                    # match end of string
    /x                    # free-spacing regex definition mode

"$ ^*123@-"[r, 1]         #=> '123'
"$ ^*123@-a?"[r, 1]       #=> nil
"$9^*123@-"[r, 1]         #=> nil

#2 Use a regular expression with \K and a positive lookahead

r = /
    \A                    # match beginning of string
    [^[[:alnum:]]]*       # match 0+ chars other than digits and lc letters
    \K                    # discard all matched so far
    \d+                   # match 1+ digits
    (?=[^[[:alnum:]]]*\z) # match 0+ chars other than digits and lc letters
                          # in a positive lookahead
    /x                    # free-spacing mode

"$ ^*123@-"[r]            #=> '123'
"$ ^*123@-a?"[r]          #=> nil
"$9^*123@-"[r]            #=> nil

Note that we cannot have a positive lookbehind in place of \K as Ruby does not support variable-length lookbehinds.

#3 Use simpler regular expressions together with String methods

def extract(str)
  return nil if str =~ /[[:alpha:]]/
  a = str.scan(/\d+/)
  a.size == 1 ? a.first : nil
end

extract("$ ^*123@-")      #=> '123'
extract("$ ^*123@-a?")    #=> nil
extract("$9^*123@-")      #=> nil
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • 1
    Where I have `[^[[:alnum:]]]*`, which uses the POSIX expression `[[:alnum:]]`, I formerly had `[^0-9a-z]*` (and `/ix`). I noticed @WiktorStribiżew used `[[:alnum:]]` instead. That's a better choice, so I adopted it. Not only is it simpler, but it recognizes non-ASCII (e.g., accented) characters. – Cary Swoveland May 20 '17 at 21:49