1

Given the following string, I'd like to match the elements of the list and parts of the rest after the colon:

foo,bar,baz:something

I.e. I am expecting the first three match groups to be "foo", "bar", "baz". No commas and no colon. The minimum number of elements is 1, and there can be arbitrarily many. Assume no whitespace and lower case.

I've tried this, which should work, but doesn't populate all the match groups for some reason:

^([a-z]+)(?:,([a-z]+))*:(something)

That matches foo in \1 and baz (or whatever the last element is) in \2. I don't understand why I don't get a match group for bar.

Any ideas?

EDIT: Ruby 1.9.3, if that matters.

EDIT2: Rubular link: http://rubular.com/r/pDhByoarbA

EDIT3: Add colon to the end, because I am not just trying to match the list. Sorry, oversimplified the problem.

Christoph
  • 1,580
  • 5
  • 17
  • 29

4 Answers4

4

This expression works for me: /(\w+)/i

Paul Simpson
  • 2,504
  • 16
  • 28
  • Good call. It doesn't work if there's other stuff behind the list that has to be matched, most simply a $. Let me update my question. Thanks! – Christoph Apr 28 '12 at 21:23
  • So you're trying to match "baz:" for the third group? – Paul Simpson Apr 28 '12 at 21:26
  • No, just baz. No colon. The list is part of a larger string. I thought omitting some of it would help, but did create confusion. Sorry about that. – Christoph Apr 28 '12 at 21:27
  • Correct. Sorry, there's more after the : that I want to match, so as soon as the colon is added to the regex, it won't match everything in the list. – Christoph Apr 28 '12 at 21:32
  • See newest rubular link. Sorry again for the confusion. – Christoph Apr 28 '12 at 21:34
  • I am accepting this one, because it's elegant and, if I just split my string at the colon, works great. Thanks! – Christoph Apr 28 '12 at 21:57
1

If you want to do it with regex, how about this?

(?<=^|,)("[^"]*"|[^,]*)(?=,|$)

This matches comma-separated fields, including the possibility of commas appearing inside quoted strings like 123,"Yes, No". Regexr for this.

More verbosely:

(?<=^|,)       # Must be preceded by start-of-line or comma
(
    "[^"]*"|   # A quote, followed by a bunch of non-quotes, followed by quote, OR
    [^,]*      # OR anything until the next comma
)
(?=,|$)        # Must end with comma or end-of-line

Usage would be with something like Python's re.findall(), which returns all non-overlapping matches in the string (working from left to right, if that matters.) Don't use it with your equivalent of re.search() or re.match() which only return the first match found.

(NOTE: This actually doesn't work in Python because the lookbehind (?<=^|,) isn't fixed width. Grr. Open to suggestions on this one.)


Edit: Use a non-capturing group to consume start-of-line or comma, instead of a lookbehind, and it works in Python.

>>> test_str = '123,456,"String","String, with, commas","Zero-width fields next",,"",nyet,123'
>>> m = re.findall('(?:^|,)("[^"]*"|[^,]*)(?=,|$)',test_str)
>>> m
['123', '456', '"String"', '"String, with, commas"',
 '"Zero-width fields next"', '', '""', 'nyet', '123']

Edit 2: The Ruby equivalent of Python's re.findall(needle, haystack) is haystack.scan(needle).

Community
  • 1
  • 1
Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
  • Thanks, but this isn't Ruby and it's overkill for what I need. It also doesn't solve the problem, which has changed slightly from when I posted it. Sorry for the confusion! – Christoph Apr 28 '12 at 21:51
  • @Christoph: [You can't pull out an arbitrary number of match groups (`\1, \2, \3...`) with one match.](http://stackoverflow.com/questions/464736/python-regular-expressions-how-to-capture-multiple-groups-from-a-wildcard-expr) You're going to need `string.scan(pattern)`. Also the fact that the example is in Python is no obstacle to using it in Ruby - regular expressions are [mostly the same between them.](http://www.regular-expressions.info/refflavors.html) – Li-aung Yip Apr 28 '12 at 21:54
  • Yeah maybe. Or capture a repeating group like explained here: http://www.regular-expressions.info/captureall.html – Christoph Apr 28 '12 at 21:55
  • @Christoph: That's basically suggesting that you wrap your entire regexp up in a capturing parenthesis group. That will still only give you one group, `\1`: `foo,bar,baz` (which may be good enough for you.) If there are arbitrary numbers of fields, you can't get `foo`, `bar`, `baz` separately as `\1, \2, \3`. If you need that, you need Ruby's `string.scan()`. – Li-aung Yip Apr 28 '12 at 21:58
0

Maybe split will be better solution for this case?

'foo,bar,baz'.split(',')
=> ["foo", "bar", "baz"]
Flexoid
  • 4,155
  • 21
  • 20
  • I know I can split. This is something that should be possible in regex and just an exercise to improve my regex-fu. Thanks though! – Christoph Apr 28 '12 at 21:21
  • Actually, if commas may appear inside quoted strings, `str.split()` will do the wrong thing. OP didn't specify what his actual input is - but something to consider. ;) – Li-aung Yip Apr 28 '12 at 21:46
0

If I am interpreting your post correctly, you want everything separated by commas before the colon (:).

The appropriate regex for this would be:

[^\s:]*(,[^\s:]*)*(:.*)?

This should find everything you are looking for.

C.Holloway
  • 11
  • 4