651

Is there a quick way to find every match of a regular expression in Ruby? I've looked through the Regex object in the Ruby STL and searched on Google to no avail.

warren
  • 32,620
  • 21
  • 85
  • 124
Chris Bunch
  • 87,773
  • 37
  • 126
  • 127

6 Answers6

894

Using scan should do the trick:

string.scan(/regex/)
Andrew Marshall
  • 95,083
  • 20
  • 220
  • 214
Jean
  • 21,329
  • 5
  • 46
  • 64
  • 10
    But what abut this case? "match me!".scan(/.../) = [ "mat", "ch " "me!" ], but all occurrences of /.../ would be [ "mat", "atc", "tch", "ch ", ... ] – Michael Dickens Dec 25 '11 at 23:22
  • 14
    Not it wouldn't be. /.../ is a normal greedy regexp. It won't backtrack on matched content. you could try to use a lazy regexp but even that probably won't be enough. have a look at the regexp doc http://www.ruby-doc.org/core-1.9.3/Regexp.html to correctly express your regexp :) – Jean Jan 03 '12 at 15:31
  • @MichaelDickens There are ways of making Perl regexes do that, such that you can pull out all the overlapping matches, too, but insofar as I am aware, only Perl itself and PCRE support that sort of match operation. – tchrist Mar 24 '12 at 15:29
  • 59
    this seems like a Ruby WTF... why is this on String instead of Regexp with the other regexp stuff? It isn't even mentioned anywhere on the docs for Regexp – Anentropic Mar 12 '13 at 11:36
  • 11
    I guess it's because it's defined and called on String not on Regex ... But it does actually make sense. You can write a regular expression to capture all matches using Regex#match and iterate over captured groups. Here you write a partial match function and want it applied mutiple times on a given string, this is not the responsibility of Regexp. I suggest you check the implementation of scan for a better understanding: http://ruby-doc.org/core-1.9.3/String.html#method-i-scan – Jean Mar 12 '13 at 12:29
  • @Anentropic You could just make a method on regex yourself if you wanted to :) `class Regex \n def scan(string) \n string.scan(self) \n end \n end` – Automatico May 20 '14 at 10:51
  • 10
    @MichaelDickens: In this case, you can use `/(?=(...))/`. – Konrad Borowski Oct 25 '14 at 13:50
  • Seems like `scan` does not support back-referencing in the regex (unlike `match`) – hek2mgl Dec 29 '14 at 13:43
  • Is there something like scan that returns indices instead of values? – Justin Jul 07 '15 at 18:35
  • @justin, not that I know of – Jean Jul 08 '15 at 07:58
  • 2
    Thanks @xfix, to get in a flat array `/(?=(...))/.flatten` – ryan2johnson9 Oct 21 '15 at 03:58
  • thanks @xfix, this works perfectly for me, but do you mind explain why using a positive lookahead and capture group will do the trick here? Thanks! – Delong Gao May 10 '18 at 00:20
  • 2
    @DelongGao it makes the regex engine think that the match ending position is the starting position. Normally, matches cannot overlap, and to avoid this issue regex engine starts searching from the ending position of previous match. – Konrad Borowski May 10 '18 at 06:16
  • @Wiktor Stribiżew how about using scan without removing the delimiters ? for this you mentioned ? result = text.scan(/#{starts}(.*?)#{ends}/m) – Afsanefda May 26 '18 at 07:11
  • Suppose `str = "a1ab2cd3d"` and we wish to find all digits that are preceded and followed by the same letter. We could use the regex `r = /(?<=(\p{Alpha}))\d(?=\1)/`. Then `str.scan(r) #=> [["a"], ["d"]]`, which is not what is wanted but understandable because of the way `scan` treats capture groups. We can, however, obtain the desired result as follows: `str.gsub(r).to_a #=> ["1", "3"]`. My point is that `scan` is not always the solution. – Cary Swoveland Mar 14 '19 at 17:05
  • @Konrad thanks so much for your elegant example. Could you elaborate a bit more on lookahead usage here? Particularly why it would advance by one character in the string for each match... i mean I know it would be an infinite response set of the first match if it didn't, but just wanna understand the empty prefix lookahead a bit better here. Thanks! – Christopher Kuttruff Dec 07 '19 at 18:27
  • @ChristopherKuttruff Most regex engines move forward by one character on empty matches. Otherwise you would have an infinite loop, which isn't particularly useful. – Konrad Borowski Dec 08 '19 at 11:05
89

To find all the matching strings, use String's scan method.

str = "A 54mpl3 string w1th 7 numb3rs scatter36 ar0und"
str.scan(/\d+/)
#=> ["54", "3", "1", "7", "3", "36", "0"]

If you want, MatchData, which is the type of the object returned by the Regexp match method, use:

str.to_enum(:scan, /\d+/).map { Regexp.last_match }
#=> [#<MatchData "54">, #<MatchData "3">, #<MatchData "1">, #<MatchData "7">, #<MatchData "3">, #<MatchData "36">, #<MatchData "0">]

The benefit of using MatchData is that you can use methods like offset:

match_datas = str.to_enum(:scan, /\d+/).map { Regexp.last_match }
match_datas[0].offset(0)
#=> [2, 4]
match_datas[1].offset(0)
#=> [7, 8]

See these questions if you'd like to know more:

Reading about special variables $&, $', $1, $2 in Ruby will be helpful too.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
sudo bangbang
  • 27,127
  • 11
  • 75
  • 77
15

if you have a regexp with groups:

str="A 54mpl3 string w1th 7 numbers scatter3r ar0und"
re=/(\d+)[m-t]/

you can use String's scan method to find matching groups:

str.scan re
#> [["54"], ["1"], ["3"]]

To find the matching pattern:

str.to_enum(:scan,re).map {$&}
#> ["54m", "1t", "3r"]

Or the solution to have the complete matchdata:

str.to_enum(:scan,re).map{Regexp.last_match}
#> [#<MatchData "54m" 1:"54">, #<MatchData "1t" 1:"1">, #<MatchData "3r" 1:"3">]

str.to_enum(:scan,re).map {$~}
#> [#<MatchData "54m" 1:"54">, #<MatchData "1t" 1:"1">, #<MatchData "3r" 1:"3">]
MVP
  • 1,061
  • 10
  • 8
  • `str.scan(/\d+[m-t]/) # => ["54m", "1t", "3r"]` is more idiomatic than `str.to_enum(:scan,re).map {$&}` – the Tin Man Apr 09 '20 at 17:43
  • Maybe you misunderstood. The regular expression of the example of a user I replied was: `/(\d+)[m-t]/` not `/\d+[m-t]/` To write: `re = /(\d+)[m-t]/; str.scan(re)` is same `str.scan(/(\d+)[mt]/)` but I get #> `[["" 54 "], [" 1 "], [" 3 "]]` and not `"54m", "1t", "3r"]` The question was: if I have a regular expression with a group and want to capture all the patterns without changing the regular expression (leaving the group), how can I do it? In this sense, a possible solution, albeit a little cryptic and difficult to read, was: `str.to_enum(:scan,re).map {$&}` – MVP Apr 15 '20 at 15:43
8

You can use string.scan(your_regex).flatten. If your regex contains groups, it will return in a single plain array.

string = "A 54mpl3 string w1th 7 numbers scatter3r ar0und"
your_regex = /(\d+)[m-t]/
string.scan(your_regex).flatten
=> ["54", "1", "3"]

Regex can be a named group as well.

string = 'group_photo.jpg'
regex = /\A(?<name>.*)\.(?<ext>.*)\z/
string.scan(regex).flatten

You can also use gsub, it's just one more way if you want MatchData.

str.gsub(/\d/).map{ Regexp.last_match }
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Datt
  • 851
  • 9
  • 21
  • Remove the grouping from `your_regex = /(\d+)[m-t]/` and you won't need to use `flatten`. Your final example uses `last_match` which in this case is probably safe, but is a global and could possibly be overwritten if any regex was matched prior to calling `last_match`. Instead it's probably safer to use `string.match(regex).captures # => ["group_photo", "jpg"]` or `string.scan(/\d+/) # => ["54", "3", "1", "7", "3", "0"]` as shown in other answers, depending on the pattern and needs. – the Tin Man Apr 09 '20 at 17:23
1

If you have capture groups () inside the regex for other purposes, the proposed solutions with String#scan and String#match are problematic:

  1. String#scan only get what is inside the capture groups;
  2. String#match only get the first match, rejecting all the others;
  3. String#matches (proposed function) get all the matches.

On this case, we need a solution to match the regex without considering the capture groups.

String#matches

With the Refinements you can monkey patch the String class, implement the String#matches and this method will be available inside the scope of the class that is using the refinement. It is an incredible way to Monkey Patch classes on Ruby.

Setup

  • /lib/refinements/string_matches.rb
# This module add a String refinement to enable multiple String#match()s
# 1. `String#scan` only get what is inside the capture groups (inside the parens)
# 2. `String#match` only get the first match
# 3. `String#matches` (proposed function) get all the matches
module StringMatches
  refine String do
    def matches(regex)
      scan(/(?<matching>#{regex})/).flatten
    end
  end
end

Used: named capture groups

Usage

  • rails c
> require 'refinements/string_matches'

> using StringMatches

> 'function(1, 2, 3) + function(4, 5, 6)'.matches(/function\((\d), (\d), (\d)\)/)
=> ["function(1, 2, 3)", "function(4, 5, 6)"]

> 'function(1, 2, 3) + function(4, 5, 6)'.scan(/function\((\d), (\d), (\d)\)/)
=> [["1", "2", "3"], ["4", "5", "6"]]

> 'function(1, 2, 3) + function(4, 5, 6)'.match(/function\((\d), (\d), (\d)\)/)[0]
=> "function(1, 2, 3)"
Victor
  • 1,904
  • 18
  • 18
1

Return an array of MatchData objects

#scan is very limited--only returns a simple array of strings!

Far more powerful/flexible for us to get an array of MatchData objects.

I'll provide two approaches (using same logic), one using a PORO and one using a monkey patch:

PORO:

class MatchAll
  def initialize(string, pattern)
    raise ArgumentError, 'must pass a String' unless string.is_a?(String)

    raise ArgumentError, 'must pass a Regexp pattern' unless pattern.is_a?(Regexp)

    @string = string
    @pattern = pattern
    @matches = []
  end

  def match_all
    recursive_match
  end

  private

  def recursive_match(prev_match = nil)
    index = prev_match.nil? ? 0 : prev_match.offset(0)[1]

    matching_item = @string.match(@pattern, index)
    return @matches unless matching_item.present?

    @matches << matching_item
    recursive_match(matching_item)
  end
end

USAGE:

test_string = 'a green frog jumped on a green lilypad'

MatchAll.new(test_string, /green/).match_all
=> [#<MatchData "green", #<MatchData "green"]

Monkey patch

I don't typically condone monkey-patching, but in this case:

  • we're doing it the right way by "quarantining" our patch into its own module
  • I prefer this approach because 'string'.match_all(/pattern/) is more intuitive (and looks a lot nicer) than MatchAll.new('string', /pattern/).match_all
module RubyCoreExtensions
  module String
    module MatchAll
      def match_all(pattern)
        raise ArgumentError, 'must pass a Regexp pattern' unless pattern.is_a?(Regexp)

        recursive_match(pattern)
      end

      private

      def recursive_match(pattern, matches = [], prev_match = nil)
        index = prev_match.nil? ? 0 : prev_match.offset(0)[1]

        matching_item = self.match(pattern, index)
        return matches unless matching_item.present?

        matches << matching_item
        recursive_match(pattern, matches, matching_item)
      end
    end
  end
end

I recommend creating a new file and putting the patch (assuming you're using Rails) there /lib/ruby_core_extensions/string/match_all.rb

To use our patch we need to make it available:

# within application.rb
require './lib/ruby_core_extensions/string/match_all.rb'

Then be sure to include it in the String class (you could put this wherever you want; but for example, right under the require statement we just wrote above. After you include it once, it will be available everywhere, even outside the class where you included it).

String.include RubyCoreExtensions::String::MatchAll

USAGE: And now when you use #match_all you get results like:

test_string = 'hello foo, what foo are you going to foo today?'

test_string.match_all /foo/
=> [#<MatchData "foo", #<MatchData "foo", #<MatchData "foo"]

test_string.match_all /hello/
=> [#<MatchData "hello"]

test_string.match_all /none/
=> []

I find this particularly useful when I want to match multiple occurrences, and then get useful information about each occurrence, such as which index the occurrence starts and ends (e.g. match.offset(0) => [first_index, last_index])

some_guy
  • 49
  • 8
  • why this answer makes it so complicated? why not make it easy, just answer: `String#scan` ? – Siwei Feb 14 '23 at 12:24