How do I get the match data for all occurrences of a Ruby regular expression in a string?

Question

I need the MatchData for each occurrence of a regular expression in a string. This is different than the scan method suggested in Match All Occurrences of a Regex, since that only gives me an array of strings (I need the full MatchData, to get begin and end information, etc).

input = "abc12def34ghijklmno567pqrs"
numbers = /\d+/

numbers.match input # #<MatchData "12"> (only the first match)
input.scan numbers  # ["12", "34", "567"] (all matches, but only the strings)

I suspect there is some method that I've overlooked. Suggestions?

I want the begin and end positions for each match. But that is irrelevant to my question. MatchData exists for a reason, doesn't it? If I can get it for the first match, it follows that it would be useful for all matches. — Joshua Flanagan, Jul 24 '11 at 02:32
Ok, I want more than one thing, in a convenient package, for each match. — Joshua Flanagan, Jul 24 '11 at 02:54
You have the convenient package, as you name it, in the solution I gave below (from which you can get begin, end or whatever match data you need as you wish) . Or is it anything else that you are looking for? — i-blis, Jul 24 '11 at 22:29

score 75 · Accepted Answer · edited Jun 17 '20 at 23:59

75

You want

"abc12def34ghijklmno567pqrs".to_enum(:scan, /\d+/).map { Regexp.last_match }

which gives you

[#<MatchData "12">, #<MatchData "34">, #<MatchData "567">]

The "trick" is, as you see, to build an enumerator in order to get each last_match.

edited Jun 17 '20 at 23:59

the Tin Man

158,662
42
215
303

answered Jul 24 '11 at 15:29

i-blis

3,149
24
31

1

Thank you. This just made my life 10 times easier. – Linuxios Dec 28 '12 at 18:11
This should be on apidock.com or similar. You saved me from at least 10 new grey hairs :) – nex Apr 15 '14 at 10:20
3

It's unbelievable that there isn't a built-in method for this, that we have to resort to a hack like this. – Miscreant Feb 19 '16 at 06:00

score 9 · Answer 2 · edited Jun 18 '20 at 00:06

9

I’ll put it here to make the code available via a search:

input = "abc12def34ghijklmno567pqrs"
numbers = /\d+/
input.gsub(numbers) { |m| p $~ }

The result is as requested:

⇒ #<MatchData "12">
⇒ #<MatchData "34">
⇒ #<MatchData "567">

See "input.gsub(numbers) { |m| p $~ } Matching data in Ruby for all occurrences in a string" for more information.

edited Jun 18 '20 at 00:06

the Tin Man

158,662
42
215
303

answered Feb 02 '13 at 13:34

Aleksei Matiushkin

119,336
10
100
160

Thanks for doing that, works perfectly, especially as I wanted to actually use `gsub` anyway. – rjh May 05 '14 at 14:53
Rather than do this, use `scan` if all you intend to do is get the MatchData. It communicates intention clearer. – Justin Aug 05 '15 at 20:59
@justin, the question *explicitly* says that `scan` does not return MatchData's, but just an array of matched strings. – DeFazer Mar 03 '17 at 20:06
@DeFazer it's been a while, but iirc, `$~` is the `MatchData` for the last match, which would make my comment relevant still – Justin Mar 04 '17 at 03:22
@Justin, technically, you are right. `$~` is, indeed, the `MatchData` for the last match. However, there is a little trick - since `gsub` sets `$~` multiple times per iteration, on each iteration `{ |m| p $~ }` returns different `MatchData`'s. Besides, I'm not sure I understand how `scan` can be useful in getting `MatchData`'s. Can you explain this part, please? – DeFazer Mar 04 '17 at 13:51
@DeFazer as a drop in replacement for gsub here. http://ideone.com/tRfi12 – Justin Mar 04 '17 at 16:32
@Justin oh! I see. Thanks, now I get what you mean. – DeFazer Mar 04 '17 at 17:47

score 9 · Answer 3 · edited Jun 18 '20 at 00:04

9

My current solution is to add an each_match method to Regexp:

class Regexp
  def each_match(str)
    start = 0
    while matchdata = self.match(str, start)
      yield matchdata
      start = matchdata.end(0)
    end
  end
end

Now I can do:

numbers.each_match input do |match|
  puts "Found #{match[0]} at #{match.begin(0)} until #{match.end(0)}"
end

Tell me there is a better way.

edited Jun 18 '20 at 00:04

the Tin Man

158,662
42
215
303

answered Jul 24 '11 at 02:19

Joshua Flanagan

8,527
2
31
40

this should actually be appended to your original question, unless you intend it to be the answer. – the Tin Man Jul 24 '11 at 02:45
Also, `while matchdata = self.match(str, start)` is considered a very hard to maintain construct because it is difficult to know if this is an error or intentional. – the Tin Man Jul 24 '11 at 02:47
4

Why should it be appended to the question? It's an answer. I'm just hoping there is a better answer, which is why I didn't just accept my own. If a better answer isn't found, then eventually I will mark it as the answer. – Joshua Flanagan Jul 24 '11 at 02:52
Please reread what I wrote. Append it *UNLESS* you intend it to be the answer. Stack Overflow prefers that information added by the original poster be appended to your original question, however answers provided by the OP can be added as an answer. http://stackoverflow.com/faq#howtoask – the Tin Man Jul 24 '11 at 16:45
It's clean, it's easy to read and it works just fine. You could write is a an [enumerator](http://stackoverflow.com/a/43167606/6419007) if you wish. I didn't notice your answer before writing mine. They're basically the same. – Eric Duminil Apr 02 '17 at 11:19

score 4 · Answer 4 · answered Nov 23 '17 at 01:58

I'm surprised nobody mentioned the amazing StringScanner class included in Ruby's standard library:

require 'strscan'

s = StringScanner.new('abc12def34ghijklmno567pqrs')

while s.skip_until(/\d+/)
  num, offset = s.matched.to_i, [s.pos - s.matched_size, s.pos - 1]

  # ..
end

No, it doesn't give you the MatchData objects, but it does give you an index-based interface into the string.

score 0 · Answer 5 · answered Nov 23 '17 at 00:55

0

input = "abc12def34ghijklmno567pqrs"
n = Regexp.new("\\d+")
[n.match(input)].tap { |a| a << n.match(input,a.last().end(0)+1) until a.last().nil? }[0..-2]

=> [#<MatchData "12">, #<MatchData "34">, #<MatchData "567">]

answered Nov 23 '17 at 00:55

Lyndon S

645
7
6

How do I get the match data for all occurrences of a Ruby regular expression in a string?

5 Answers5

Linked