3

I am trying to parse a pattern with regular expressions in Ruby. The pattern is something like,

<number>? <comma>? <number>? <term>*

where:

  • number is one or more digits
  • comma is ","
  • term is of the form [.*] or [^.*]

And I am trying to capture the numbers, and all the terms. To clarify, here are some examples of valid patterns:

5,50[foo,bar]
5,[foo][^apples]
10,100[baseball][^basketball][^golf]
,55[coke][pepsi][^drpepper][somethingElse]

In the first, I'd like to capture 5, 50, and [foo,bar] In the second, I'd like to capture 5, [foo] and [^apples] and so on.

The pattern I came up with is:

/(\d+)?,?(\d+)?(\[\^?[^\]]+\])+/

but this only matches the numbers and the last term. If I remove the + at the end, then it only matches the first term.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Anurag
  • 140,337
  • 36
  • 221
  • 257

2 Answers2

1

Easiest solution that I can think of with minimal effort would probably be to just throw on an additional capture group by surrounding the group and the + that are already there, i.e.

/(\d+)?,?(\d+)?((\[\^?[^\]]+\])+)/

Also, you could probably simplify the \d expressions by just doing (\d*) instead of (\d+)?...

EDIT

Here's the code used to test the above suggestions:

matches = [ "5,50[foo,bar]",
            "5,[foo][^apples]",
            "10,100[baseball][^basketball][^golf]",
            ",55[coke][pepsi][^drpepper][somethingElse]"
          ]

re = Regexp.new('(\d*),?(\d*)((\[\^?[^\]]+\])+)')

matches.each do |match|
  m = re.match(match)

  puts "\nMatching: #{match}"
  puts "--------------------"

  puts "Match 1: #{m[1]}"
  puts "Match 2: #{m[2]}"
  puts "Match 3: #{m[3]}"
end

and the output:

Matching: 5,50[foo,bar]
--------------------
Match 1: 5
Match 2: 50
Match 3: [foo,bar]

Matching: 5,[foo][^apples]
--------------------
Match 1: 5
Match 2: 
Match 3: [foo][^apples]

Matching: 10,100[baseball][^basketball][^golf]
--------------------
Match 1: 10
Match 2: 100
Match 3: [baseball][^basketball][^golf]

Matching: ,55[coke][pepsi][^drpepper][somethingElse]
--------------------
Match 1: 
Match 2: 55
Match 3: [coke][pepsi][^drpepper][somethingElse]

Edit 2

If you're wanting tokenization, as per J-_-L's suggestion with the scan method, add in:

m[3].scan(/\[\^?[^\]]+\]/)
photoionized
  • 5,092
  • 20
  • 23
  • I've tried this in Ruby and JavaScript - but that is returning all the terms combined, and only the last term separately. Since it is returning all the terms combined - `[foo][^apples]` in the second example, and the last term separately - `[^apples]`, I'm guessing it is able to find the matches, but its not present in the output anywhere. No idea what I am missing. – Anurag May 20 '11 at 00:30
  • maybe I'm misunderstanding something... are you trying to effectively tokenize each of the "terms"? If so regex aren't the way to go for the tokenization part, just split based on `][` after capturing all of the "terms" together--no language to my knowledge allows a variable number of capture groups in its regex engine. I wrote up a quick and dirty check in ruby which I'll post as an edit. Tell me if I'm misreading your question. – photoionized May 20 '11 at 00:46
  • thanks for the suggestion on splitting the original input, and then scanning the grouped string. it works beautifully. – Anurag May 20 '11 at 01:12
1

It's the same problem like here - you only have a fixed number of capture groups.

In your case, I would split the string (e.g. with photoionized's method) and do a scan (for example with (\[\^?[^\]]+\])) to get the groups.

Community
  • 1
  • 1
J-_-L
  • 9,079
  • 2
  • 40
  • 37
  • @J - Thanks. It works great. I am going to go with `treetop` and create a small parser to do this, as it feels a little more cleaner. Appreciate your help. – Anurag May 20 '11 at 01:13