0

I'm working on an HTML5/JavaScript game engine, and I have started to encounter a scenario I haven't ever been in the past, and can't figure out how I can pull this off.

Simply put, I want to split a string into an array, by a character - so long as that character is not within parenthesis.

Basically, in XML files for things like items/tiles, I store "triggers", which are statements giving rules for operations the code will perform. The different parameters of a single trigger are split up with a colon (:), and more than one trigger can be in place for an item, whereby each trigger is split by a comma. Here's an example:

<response trigger="npc:self:dialog:1:3">No, thank you.</response>

(This is basically saying: if this response is selected, make the NPC who asked the initial question cycle to the a specific message of a specific conversion)

Moving along: I have come to need the ability to encapsulate callback triggers within parenthesis of parameters with certain triggers. Here's an example:

<response trigger="shop:open:1:(npc:self:dialog:1:4)">Yes, please.</response>

(This is basically saying: open up a specific store, and when the store is closed, jump to a specific conversation/message for the speaking NPC)

The idea is that when a store is closed, I can invoke the 4th parameter of that trigger (which is a trigger itself). As I am sure you have guessed, the problem here is that if I split the initial trigger-string based on ":", then it breaks up the callback trigger as other (messy) parameters of the main trigger. I don't want that. Nor, do I want to do anything like splitting secondary triggers by another character (for generation reasons later on, and because I imagine there will be times where I will want to nest lots of triggers at deeper levels and I don't want to use different characters. I know of work-arounds, but I'd like to learn the proper way to split by a character that is not contained within other specific characters.

Since I am encapsulating the callback parameter with parenthesis, I figure there must be a clean regular expression I can use to split the main trigger string by all colons NOT INSIDE parenthesis.

Sadly, I haven't been able to come up with the right expression to get this done.

Any ideas?

I greatly appreciate any assistance any of you may have. :)

SikoSoft
  • 584
  • 1
  • 7
  • 13
  • Key question to determine if this can be parsed using regex: can parentheses be nested? – cobbal Apr 23 '11 at 18:12
  • @cobbal: I'd say yes, quoting "I imagine there will be times where I will want to nest lots of triggers at deeper levels". – Christian Semrau Apr 23 '11 at 18:31
  • @ChristianSemrau Ah, I seem to have missed that bit. – cobbal Apr 23 '11 at 18:38
  • An off-topic advice: Use [JSON](http://json.org). Your life will be easier. – Zirak Apr 23 '11 at 18:38
  • @Zirak Using JSON wouldn't make any difference here - JSON would merely replace XML as the structural container, and the data I am working on is a string (and shall remain so). If I was structuring the trigger as XML (which for various reasons I won't) then it might be relevant. And for the record, I disagree on the battle of JSON vs XML; the extra bloat of XML is a small price to pay for a data container that is structurally human-readable at first glance (especially since I manually write these data files for now). – SikoSoft Apr 23 '11 at 18:55
  • Thanks everyone for their suggestions & input - I'll admit I was somewhat unrealistic in my hope that there might be a one-line expression, so I'll stop wasting time on trying to figure out what doesn't seem to be possible and just write a blasted parsing wrapper. Thanks for the examples / recommendations. - Appreciated! – SikoSoft Apr 23 '11 at 18:57
  • Oh, I suspect in javascript you could write it as a one liner, probably using a ?: and two lambdas. But it would be write-only code.... I have to admit I find JSON representation at least as easy to read as XML, but whatever floats your boat. – Charlie Martin Apr 24 '11 at 03:27

3 Answers3

1

I suspect you can't, at least if there's any chance of nested parentheses, since recognizing correct parenthesis-nesting is not regular.

In any case, instead of constructing some baroque regular expression, consider a very simple parser: scan to the next occurrence of either ":" or "(", and do something with the next token. Repeat. It would be easy to do with with recursive descent, and would look something like

parse(string)
   if string is empty: return
   scan to delimiter, put delimiter index into d, token string into t
   put t into a table for processing later
   case on d:
      string[d] == ":": parseColonToken(string[d+1:])
      string[d] == "(": parseParentString(strin[d+1:])
   end
end

(Obviously this is pseudocode. Take string[n:] as "the substring of string from index n to the end.)

probably, thinking about it, you'd simply start with parseColonToken but I'm not sure if that matches your expected grammar.

Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
  • I get the feeling now that you can't either when I spent quite a while writing expressions that failed, and then the first two of you seem quite sure that this isn't even a regular expression scenario. I appreciate the example code - I'll play around with writing a parser of some kind. I think I had my hopes up for a one-liner a bit unrealistically. >_ – SikoSoft Apr 23 '11 at 18:48
  • assuming the parens can be nested, then it's a theorem that you can't. Honestly, that recursive descent thing is so simple I've been thinking of other uses for it all afternoon. – Charlie Martin Apr 24 '11 at 03:25
0

There is a good reason why you were unable to find a regular expression for your problem:

The language you describe is not regular, i.e. it cannot be parsed with a regular expression.

Basically, you have to parse the parenthesis structure in order to determine the colons which are outside of all parentheses. This is not possible with a regular expression.

The language of nested parenthesis is context-free [1], so it straight-forward to write a recursive parser.

[1] http://en.wikipedia.org/wiki/Context-free_language

ADDITION: You don't need a recursive parser, a simple counter for the parenthesis nesting level is enough:

// Pseudo code
int depth = 0;
List<int> breakIndices;
for int index = 0 .. input.length-1:
  switch(input[index])
    ':': if (depth==0) breakIndices.add(index); break;
    '(': depth++; break;
    ')': depth--; break;
    default: break;
// Now, all indices of the colons you need are in the breakIndices list.
Christian Semrau
  • 8,913
  • 2
  • 32
  • 39
  • Thanks for the insight & the example approach. I'll admit that maybe I was wishful thinking in that a one-liner could pull it off, so I'll play around with your example (and the other provided here) and get something together. - again, I appreciate your explanation of why this isn't even a regular expression scenario. – SikoSoft Apr 23 '11 at 18:46
  • Indeed. But I know of no evidence for an extended regex engine that can parse the language of nested parentheses. – Christian Semrau Apr 23 '11 at 19:56
0

I think the easiest approach would be to break the string into the "function" part and the "argument" part and then deal with the two parts separately. If you want to keep the parentheses on the argument part, then:

var parts1 = "shop:open:1:(npc:self:dialog:1:4)".split(/:(?=\()/);
// parts1 now looks like ["shop:open:1", "(npc:self:dialog:1:4)"]
var parts2 = "shop:open:1".split(/:(?=\()/);
// parts2 now looks like ["shop:open:1"]

And then:

var cmd = null;
var arg = null;
if(parts.length > 0) {
    cmd = parts[0].split(':');
    arg = (parts[1] || '').replace(/[()]/g, '').split(':');
}

You might be able to cram more of that into a single regex (and possibly all of it depending on what non-regular features your target regex engine supports) but there's not much point and clarity is a better goal for your code than "short". Anyone looking at the above should be able to figure out what it is doing if they have a decent JavaScript regex reference in hand.

If you end up dealing with more complicated expressions with quoting and escaping and such, then you could try modifying a CSV parser to do what you need.

Community
  • 1
  • 1
mu is too short
  • 426,620
  • 70
  • 833
  • 800