44

I need the MatchData for each occurrence of a regular expression in a string. This is different than the scan method suggested in Match All Occurrences of a Regex, since that only gives me an array of strings (I need the full MatchData, to get begin and end information, etc).

input = "abc12def34ghijklmno567pqrs"
numbers = /\d+/

numbers.match input # #<MatchData "12"> (only the first match)
input.scan numbers  # ["12", "34", "567"] (all matches, but only the strings)

I suspect there is some method that I've overlooked. Suggestions?

Community
  • 1
  • 1
Joshua Flanagan
  • 8,527
  • 2
  • 31
  • 40
  • I want the begin and end positions for each match. But that is irrelevant to my question. MatchData exists for a reason, doesn't it? If I can get it for the first match, it follows that it would be useful for all matches. – Joshua Flanagan Jul 24 '11 at 02:32
  • 1
    Ok, I want more than one thing, in a convenient package, for each match. – Joshua Flanagan Jul 24 '11 at 02:54
  • You have the convenient package, as you name it, in the solution I gave below (from which you can get begin, end or whatever match data you need as you wish) . Or is it anything else that you are looking for? – i-blis Jul 24 '11 at 22:29

5 Answers5

75

You want

"abc12def34ghijklmno567pqrs".to_enum(:scan, /\d+/).map { Regexp.last_match }

which gives you

[#<MatchData "12">, #<MatchData "34">, #<MatchData "567">] 

The "trick" is, as you see, to build an enumerator in order to get each last_match.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
i-blis
  • 3,149
  • 24
  • 31
9

I’ll put it here to make the code available via a search:

input = "abc12def34ghijklmno567pqrs"
numbers = /\d+/
input.gsub(numbers) { |m| p $~ }

The result is as requested:

⇒ #<MatchData "12">
⇒ #<MatchData "34">
⇒ #<MatchData "567">

See "input.gsub(numbers) { |m| p $~ } Matching data in Ruby for all occurrences in a string" for more information.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Aleksei Matiushkin
  • 119,336
  • 10
  • 100
  • 160
  • Thanks for doing that, works perfectly, especially as I wanted to actually use `gsub` anyway. – rjh May 05 '14 at 14:53
  • Rather than do this, use `scan` if all you intend to do is get the MatchData. It communicates intention clearer. – Justin Aug 05 '15 at 20:59
  • @justin, the question *explicitly* says that `scan` does not return MatchData's, but just an array of matched strings. – DeFazer Mar 03 '17 at 20:06
  • @DeFazer it's been a while, but iirc, `$~` is the `MatchData` for the last match, which would make my comment relevant still – Justin Mar 04 '17 at 03:22
  • @Justin, technically, you are right. `$~` is, indeed, the `MatchData` for the last match. However, there is a little trick - since `gsub` sets `$~` multiple times per iteration, on each iteration `{ |m| p $~ }` returns different `MatchData`'s. Besides, I'm not sure I understand how `scan` can be useful in getting `MatchData`'s. Can you explain this part, please? – DeFazer Mar 04 '17 at 13:51
  • @DeFazer as a drop in replacement for gsub here. http://ideone.com/tRfi12 – Justin Mar 04 '17 at 16:32
  • @Justin oh! I see. Thanks, now I get what you mean. – DeFazer Mar 04 '17 at 17:47
9

My current solution is to add an each_match method to Regexp:

class Regexp
  def each_match(str)
    start = 0
    while matchdata = self.match(str, start)
      yield matchdata
      start = matchdata.end(0)
    end
  end
end

Now I can do:

numbers.each_match input do |match|
  puts "Found #{match[0]} at #{match.begin(0)} until #{match.end(0)}"
end

Tell me there is a better way.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Joshua Flanagan
  • 8,527
  • 2
  • 31
  • 40
  • this should actually be appended to your original question, unless you intend it to be the answer. – the Tin Man Jul 24 '11 at 02:45
  • Also, `while matchdata = self.match(str, start)` is considered a very hard to maintain construct because it is difficult to know if this is an error or intentional. – the Tin Man Jul 24 '11 at 02:47
  • 4
    Why should it be appended to the question? It's an answer. I'm just hoping there is a better answer, which is why I didn't just accept my own. If a better answer isn't found, then eventually I will mark it as the answer. – Joshua Flanagan Jul 24 '11 at 02:52
  • Please reread what I wrote. Append it *UNLESS* you intend it to be the answer. Stack Overflow prefers that information added by the original poster be appended to your original question, however answers provided by the OP can be added as an answer. http://stackoverflow.com/faq#howtoask – the Tin Man Jul 24 '11 at 16:45
  • It's clean, it's easy to read and it works just fine. You could write is a an [enumerator](http://stackoverflow.com/a/43167606/6419007) if you wish. I didn't notice your answer before writing mine. They're basically the same. – Eric Duminil Apr 02 '17 at 11:19
4

I'm surprised nobody mentioned the amazing StringScanner class included in Ruby's standard library:

require 'strscan'

s = StringScanner.new('abc12def34ghijklmno567pqrs')

while s.skip_until(/\d+/)
  num, offset = s.matched.to_i, [s.pos - s.matched_size, s.pos - 1]

  # ..
end

No, it doesn't give you the MatchData objects, but it does give you an index-based interface into the string.

mwp
  • 8,217
  • 20
  • 26
0
input = "abc12def34ghijklmno567pqrs"
n = Regexp.new("\\d+")
[n.match(input)].tap { |a| a << n.match(input,a.last().end(0)+1) until a.last().nil? }[0..-2]

=> [#<MatchData "12">, #<MatchData "34">, #<MatchData "567">]
Lyndon S
  • 645
  • 7
  • 6