4

I am trying to extract some useful data (placeholders with specific parameters) from a text (some are raw text and some are xml).

The useful parts are delimited with either one of these $, %, [], {}

The examples bellow are given with a $ and shows the different possible content that i'm intrested in.

 $EX1$                       -> EX1
 $EX2(a$b$c)$                -> EX2, (, a$b$c
 $EX3(abc\x/)$               -> EX3, (, abc\x/
 $EX4(\@\,/&/)$              -> EX4, (, \@\,/&/
 $EX5/X(Z)Y/$                -> EX5, /, X(Z)Y
 $EX6/X(ABC)/1$              -> EX6, /, X(ABC), 1
 $EX7/X\\Z\/Y/$              -> EX7, /, X\\Z\/Y
 $EX8/(A)/(B)/$              -> EX8, /, (A), (B)
 $EX9/(\\$A$)\//(\\$B$\/)/$  -> EX9, /, (\\$A$)\/, (\\$B$\/)

The first part is the placeholder name, optionally followed by some parameters like (...) or /.../ or /.../xx or /.../.../ Where xx is a number and ... can be anything.

I've built the following regex witch almost does the job and I'm wondering if there is a way to improve it or even if there's another approach maybe to do the job (It must be compatible with .NET regex engine)

\$
(?=[^$]{3,100}\$)
(?<PH>[A-Za-z0-9:_-]{1,20})
(?:
  (?<C1>\/)
  (?<RX>(?:[^\\\/\r\n]|\\\/?)*)
  \/
  (?:
    (?<R>(?:[^\\\/\r\n$]|\\[\/$]?)*)
    \/
    |
    (?<G>\d*)
  )
  |
  (?:
    (?<C2>\()
    (?<F>(?:[^\t\r\n\f()]|\\[()]?)*)
    \)
  )?
)
\$

DEMO

Alexandre A.
  • 1,619
  • 18
  • 29
  • looks like `/` and `()` are also delimiters from your example, is this so? also seems it's `multiline` from your code? also regex must accept escaping via \ right? – CSᵠ Jan 23 '15 at 12:16
  • @CSᵠ Yes, () and / are the "second level" delimiters. Regex must accept escaping via "\". What does seem to be multiline? The content between the delimiters cannot contain "\n" (if it does the $placeholder(...)$ must be ignored) characters but the text itself can. – Alexandre A. Jan 23 '15 at 14:16

1 Answers1

0

Here is an "improved" version of the regex that uses balancing groups for () and {}. The capture groups are named as "ph", "FirstSep", "value1", "value2", "value3" (for testing simplicity, you can rename them as you like):

\$
(?=[^$]{3,100}\$)
(?<ph>[\w\:\-]+)
(?:(?<FirstSep>[\/\(\{])(?<value1>
    (?>
        [^{}()]+ 
        |    [\(\{] (?<number>)
        |    [\)\}] (?<-number>)
    )*
    (?(number)(?!))
)
[\)\}]
)?
(?:(?<FirstSep>/)
     (?<value2>
          \d+  |
          [^/\r\n\\]*(?>\\.[^/\r\n\\]*)*
      )?
)?
(?:/
     (?<value3>[^/\r\n\\]*(?>\\.[^/\r\n\\]*)*
      )?
)?
/?
\$

Here, you can see that it now captures sub-groups enclosed in {} or ():

$EX2(a($b)$c)$          --> EX2, (, a($b)$c 
$EX3{a({bc})\x/}$       --> EX3, {, a({bc})\x/

Nice info on matching delimited strings with escaped delimiters inside: Finding quoted strings with escaped quotes in C# using a regular expression.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563