1

So I know that regular expressions don't support variable number of groups, but since there seems to be a way to do this in C# I'm asking if there is any way to make this work in ruby? I don't have any deep knowledge of ruby so I am not really able to work this out myself.

If it is not possible, is there a way to change my logic so I can get what I wanted?

What I want to do is parse the bezier information of SVG files.

Here is my regex:

/(C)\s*(?:(-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s]{1}){5,}/

Here an example of the SVG:

<path d="M 15.43,29.45 C 23.73,28.89 38,25.96 44.2,25.42 46.48,25.22 47.41,27 47.16,29.29 46.59,34.67 45.5,46 44.14,53.63"/>
<path d="M 16.91,41.07 C 19.61,40.8 36.25,38.5 45.64,37.7"/>
<path d="M70.28,15.94c1.21,1.21,1.24,2.32,1.24,3.97c0,7.59-0.01,55.22-0.01,60.22C71.5,91,73,92.23,83.94,92.23c10.31,0,11.56-1.73,11.56-8.68"/>
<path d="M72.67,56.84c0.04,0.3,0.08,0.77-0.07,1.19C-0.9,2.52-6.07,8.03-13.15,11.41"/>

A bezier can have 6*n points. My regex matches the C and 5 successive points (I don't need the 6th) repeating if there are more than 6. When I match it like this, it will only give me the 5th point of the bezier instead of all of them.

So now, is there a feature in ruby that allows me to not overwrite the group every time?

If not, is there another way to match every point of a variable length bezier? I could just repeat the point matching routine of the regex a 100 times to match most of the real world cases but that would be silly and difficult to work with.

My ruby version is 1.9.3, updating would be no problem if it doesn't break any compabilities.

Community
  • 1
  • 1
sollniss
  • 1,895
  • 2
  • 19
  • 36
  • Although there is a regex way to extract them all (see [this demo](http://rubular.com/r/E12xILDahL)), I think there must be a simpler non-regex one. – Wiktor Stribiżew Mar 20 '17 at 21:44
  • 1
    You probably want to use an actual SVG parser rather than some intense regular expression. Any parser that performs this function will have code you can, at the very least, repurpose for your application. – tadman Mar 20 '17 at 21:45
  • Just a small warning I think is worth mentioning: "A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data" (as noted on [regex101](https://regex101.com/r/VKl2d6/1)) – jjspace Mar 20 '17 at 21:45
  • To have the coordinates as pairs, see this [**demo on regex101.com**](https://regex101.com/r/NlR90A/1/) – Jan Mar 20 '17 at 21:53
  • @WiktorStribiżew this does not use groups and also does not work on 1.9.3 according to the website. – sollniss Mar 20 '17 at 21:55
  • @Jan this does not match all SVGs. There can be multiple notations for the points (x, y, x, y or x y x y or x y, x y). – sollniss Mar 20 '17 at 21:56
  • @sollniss: Please post some more examples then. – Jan Mar 20 '17 at 21:58
  • @sln: This is exactly what @Wiktor and I were proposing. However, the version `1.9.3` does not seem to have it. – Jan Mar 20 '17 at 22:08
  • 1
    Here you go, 5 matched groups http://rubular.com/r/T6gEXwg124 –  Mar 20 '17 at 22:12
  • @Jan - Another _impossible_ situation. –  Mar 20 '17 at 22:13
  • @sln: Not with your's apparently - chapeau! – Jan Mar 20 '17 at 22:16
  • @sln I need all matches. Do any of you guys know a good parser so I can try this without regex? – sollniss Mar 20 '17 at 22:18
  • Here you go, a hybrid ( ver 1.93 ) in 5 group chunks http://rubular.com/r/jOtPuyOTSp –  Mar 20 '17 at 22:20
  • @sln Thank you that is close to what I want. Just one mistake that the 6th element is counted as the first element of the 2nd 5er group. 1 2 3 4 5 6 7 8 9 10 11 12 is matched as [1 2 3 4 5][6 7 8 9 10]. – sollniss Mar 20 '17 at 22:24
  • Don't want 6th ? That's doable. Skipping the 6th http://rubular.com/r/pa3MdJ8Zj5 –  Mar 20 '17 at 22:25
  • @sln I don't need it, but it wouldn't matter if is it contained in the group. So groups of 6 would be fine as well. – sollniss Mar 20 '17 at 22:27
  • That site - the edit box won't store over a certain amount of characters when made a permalink, just paste it into the box for an impromptu match. –  Mar 20 '17 at 22:41

2 Answers2

1

Your example doesn't make it clear why you need the C in the regex. why is that exactly? there is some other place where you can have 6+ points in a row?

Would something like this work?

(?:[\.\d]+\,[\.\d]+\s*?){5,5}

https://regex101.com/r/0VtdjW/1

roberto tomás
  • 4,435
  • 5
  • 42
  • 71
  • `{5,5}` is redundant - `{5}` will suffice. Additionally, dots do not need to be escaped within a character class. `\s*?` is redundant as well. So your expression comes down to: `(?:[.\d]+\,[.\d]+\s*){5}` – Jan Mar 20 '17 at 22:02
  • I only want to match the Cs because I need to transform them. Small c's or other points are not transformed. Your regex matches the whole group, but I need to transform each point separately, thats why I need them separately. I also added more paths as examples. – sollniss Mar 20 '17 at 22:08
  • 1
    Thanks Jan. You are right about everything more or less, though I_always_ escape periods because they are visually distinct from match wildcards that way. I don't think I will stop just because they are not essential especially since they do not change the regex – roberto tomás Mar 21 '17 at 03:18
1

Anyway, this one works using the \G construct for version 1.93 on rubular.
In a single match, it grabs the first 5 pts and skips the 6th, then repeats.

(?:(?!^)\G[-,\s]|C)\s*(-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)[-,\s](-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)(?:[-,\s]-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)?

Explained

 (?:
      (?! ^ )                       # Not BOS
      \G                            # Start where last match left off to get next 5  pts.
      [-,\s]                        # required separator
   |                             # or,
      C                             # C - the start of a block of pts.
 )
                               # The first/next 5 pts. captured
 \s* 
 (                             # (1 start)
      -? \d+ 
      (?: \. \d+ )?
      (?: [eE] [+-]? \d+ )?
 )                             # (1 end)
 [-,\s] 
 (                             # (2 start)
      -? \d+ 
      (?: \. \d+ )?
      (?: [eE] [+-]? \d+ )?
 )                             # (2 end)
 [-,\s] 
 (                             # (3 start)
      -? \d+ 
      (?: \. \d+ )?
      (?: [eE] [+-]? \d+ )?
 )                             # (3 end)
 [-,\s] 
 (                             # (4 start)
      -? \d+ 
      (?: \. \d+ )?
      (?: [eE] [+-]? \d+ )?
 )                             # (4 end)
 [-,\s] 
 (                             # (5 start)
      -? \d+ 
      (?: \. \d+ )?
      (?: [eE] [+-]? \d+ )?
 )                             # (5 end)

 (?:                           # Skip the 6th pt.
      [-,\s] 
      -? \d+ 
      (?: \. \d+ )?
      (?: [eE] [+-]? \d+ )?
 )?