0

I am trying to create a regex in ruby that matches against strings with 10 characters which are not special characters i.e. would match with \w. So far I have come up with this: /\w{10,}/ but the issue is that it will only count a consecutive sequence of word characters. I want to match any string which counts up to have at least 10 "word" characters. Is this possible? I am fairly new to regex as a whole so any help would be appreciated.

perrywinkle
  • 363
  • 3
  • 11
  • 1
    Can you include some example strings and the part(s) that you want / don't want to match? – Stefan May 17 '21 at 16:31
  • Your question is not clear. You are given a string `str`. Do you merely wish to determine if `str` contains at least 10 word characters? If so, `str.scan(/\w/).size > 10` would suffice. If you wish to extract all strings that contain 10 or more word characters you need to clarify whether `"12345678901234567890"` contains one such string, two strings (`"1234567890"` and `"1234567890"`) or 11 (possibly overlapping) strings (`"1234567890"`, `"2345678901"`, etc.). Please edit to clarify your question. – Cary Swoveland May 17 '21 at 18:38

2 Answers2

4

If I understood correctly, this should work:

/(?:\w[^\w]*){9,}\w/

Explanation:

We start with a single

\w

We want to capture all the other characters until another \w, hence:

\w[^\w]*

[^<list of chars>] matches any character other than listed in the brackets, so [^\w] means any character that is not a word character. * denotes 0 or more. The above will match "a-- ", "b" and "c!" in "a-- bc!" string.

Since we need 10 \w, we will match 9 (or more) groups like that, followed by a single \w

(\w[^\w]*){9,}\w

We don't really care for captures here (especially since ruby will ignore repeated group captures anyway, so we make the group non-capturing)

(?:\w[^\w]*){9,}\w

Alternatively we could just use simpler regex:

(?:\w[^\w]*){10,}

But it will also cover characters after the last word character in a string - not sure if this is required here.

BroiSatse
  • 44,031
  • 8
  • 61
  • 86
  • Thanks, it seems to do what I want. Could you briefly explain how this works though? Struggling to understand it. – perrywinkle May 17 '21 at 15:57
  • Thanks for the explanation, I understand the logic now. One last thing: what if we wanted to put a maximum number on this? For example, if we wanted to put a limit of 20 "word" characters, would we have to just put {10,20} ? – perrywinkle May 17 '21 at 16:02
  • 1
    It depends on how you want to use that limit. If this is for validation, you'd need to additionally wrap it between `\A` and `\z`. If you use it for a scanning, just adding limit to a range would work. – BroiSatse May 17 '21 at 16:06
  • Can you not use `\W` in place of `[^\w]`? btw, that pianist looks quite a bit younger than you. – Cary Swoveland May 17 '21 at 17:15
  • @CarySwoveland - Good point. I always have doubts whether \W is a simple negation of \w or not! But it seems it is, I'll update the answer. :) And yes, that photo was taken almost 10 years ago now. I think it is time to update... – BroiSatse May 18 '21 at 11:09
1

Match anywhere in the string:

/\w(?:\W*\w){9,19}/
/(?:\W*\w){10,20}/

Validate a string of 10 to 20 characters long:

/\A(?:\W*\w){10,20}\W*\z/

Prefer non-capturing groups, particularly when extracting found matches.

Watch out for ^ and $ that mark up start and end of the line respectively in Ruby's regex.

EXPLANATION

--------------------------------------------------------------------------------
  \A                       the beginning of the string
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (between 10 and
                           20 times (matching the most amount
                           possible)):
--------------------------------------------------------------------------------
    \W*                      non-word characters (all but a-z, A-Z, 0-
                             9, _) (0 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
    \w                      word characters (a-z, A-Z, 0-9, _) 
--------------------------------------------------------------------------------
  ){10,20}                 end of grouping
--------------------------------------------------------------------------------
  \W*                      non-word characters (all but a-z, A-Z, 0-
                           9, _) (0 or more times (matching the most
                           amount possible))
--------------------------------------------------------------------------------
  \z                       the end of the string
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37