6

I guess this will be a silly mistake but for me, the following returns an array containing only "M". See this:

/(.)+?/.match("Many many characters!").captures
=> ["M"]

Why doesn't it return an array of every character? I must have missed something blatantly obvious because I can't see whats wrong with this?

Edit: Just realised, I don't need the +? but it still doesn't work without it.

Edit: Apologies! I will clarify: my goal is to allow users to enter a regular expression and styling and an input text file, wherever there is a match, the text will be surrounded with a html element and styling will be applied, I am not just splitting the string into characters, I only used the given regex because it was the simplest although that was stupid on my part. How do I get capture groups from scan() or is that not possible? I see that $1 contains "!" (last match?) and not any others.

Edit: Gosh, it really isn't my day. As injekt has informed me, the captures are stored in separate arrays. How do I get the offset of these captures from the original string? I would like to be able to get the offset of a captures then surround it with another string. Or is that what gsub is for? (I thought that only replaced the match, not a capture group)

Hopefully final edit: Right, let me just start this again :P

So, I have a string. The user will use a configuration file to enter a regular expression, then a style associated with each capture group. I need to be able to scan the entire string and get the start and finish or offset and size of each group match.

So if a user had configured ([\w-\.]+)@((?:[\w]+\.)+)([a-zA-Z]{2,4}) (email address) then I should be able to get:

[ ["elliotpotts", 0,  11],
  ["sample.",     12, 7],
  ["com",         19, 3] ]

from the string: "elliotpotts@sample.com"

If that is not clear, there is simply something wrong with me :P. Thanks a lot so far guys, and thank you for being so patient!

Ell
  • 4,238
  • 6
  • 34
  • 60
  • I just saw your edit, capture groups from scan are stored in separate arrays, just try your regexp and a test string in irb you'll see. The answers still stand the same with your included edit – Lee Jarvis Oct 03 '11 at 18:39
  • Just saw your next edit, you'll have to update with more information. I'm a little confused now :P Feel free to throw up a more complete example no matter how contrived it is so we know exactly what you need to extract – Lee Jarvis Oct 03 '11 at 18:44
  • Alright, updated my answer with your latest edit. I'm a little tied for time right now so it's just the complete solution with no explanation, let me know if it doesn't make sense and I'll update it – Lee Jarvis Oct 03 '11 at 19:02

4 Answers4

9

Because your capture is only matching one single character. (.)+ is not the same as (.+)

>> /(.)+?/.match("Many many characters!").captures
=> ["M"]
>> /(.+)?/.match("Many many characters!").captures
=> ["Many many characters!"]
>> /(.+?)/.match("Many many characters!").captures
=> ["M"]

If you want to match every character recursively use String#scan or String#split if you don't care about capture groups

Using scan:

"Many many characters!".scan(/./)
#=> ["M", "a", "n", "y", " ", "m", "a", "n", "y", " ", "c", "h", "a", "r", "a", "c", "t", "e", "r", "s", "!"]

Note that other answer are using (.) whilst that's fine if you care about the capture group, it's a little pointless if you don't, otherwise it'll return EVERY CHARACTER in it's own separate Array, like this:

[["M"], ["a"], ["n"], ["y"], [" "], ["m"], ["a"], ["n"], ["y"], [" "], ["c"], ["h"], ["a"], ["r"], ["a"], ["c"], ["t"], ["e"], ["r"], ["s"], ["!"]]

Otherwise, just use split: "Many many characters!".split(' ')"

EDIT In reply to your edit:

reg = /([\w-\.]+)@((?:[\w]+\.)+)([a-zA-Z]{2,4})/
str = "elliotpotts@sample.com"
str.scan(reg).flatten.map { |capture| [capture, str.index(capture), capture.size] }
#=> [["elliotpotts", 0, 11], ["sample.", 12, 7], ["com", 19, 3]]`

Oh, and you don't need scan, you're not really scanning so you dont need to traverse, at least not with the example you provided:

str.match(reg).captures.map { |capture| [capture, str.index(capture), capture.size] }

Will also work

Lee Jarvis
  • 16,031
  • 4
  • 38
  • 40
  • Thank you! I have also found an alternative answer and will post it now. Thank you! – Ell Oct 03 '11 at 19:11
  • The two code snippets given do not work correctly for the offsets in the general case, they only work if the matched substrings are all different. If, for example, there are 3 matches for "h" then the same index (the first instance of 'h') is returned all 3 times. the str.index(capture) returns the index of the FIRST instance of the captured substring. – jpw Jul 07 '13 at 05:02
1

Yes, something important was missed ;-)

(...) only introduces ONE capture group: the number of times the group matches is irrelevant as the index is determined only by the regular expression itself and not the input.

The key is a "global regular expression", which will apply the regular expression multiple times in order. In Ruby this is done with inverting from Regex#match to String#scan (many other languages have a "/g" regular expression modifier):

"Many many chara­cters!".sc­an(/(.)+?/­)
# but more simply (or see answers using String#split)
"Many many chara­cters!".sc­an(/(.)/­)

Happy coding

0

It's only returning one character because that's all you've asked it to match. You probably want to use scan instead:

str = "Many many characters!"
matches = str.scan(/(.)/)
CanSpice
  • 34,814
  • 10
  • 72
  • 86
0

The following code is from Get index of string scan results in ruby and modified for my liking.

[].tap {|results|
    "abab".scan(/a/) {|capture|
        results.push(([capture, Regexp::last_match.offset(0)]).flatten)
    }
}

=> [["a", 0], ["a", 2]]
Community
  • 1
  • 1
Ell
  • 4,238
  • 6
  • 34
  • 60