3

I have this regex that tries to find individual STEP-lines and divides it into three goups of reference number, class and properties:

#14=IFCEXTRUDEDAREASOLID(#28326,#17,#9,3657.6);

becomes

[['14'], ['IFCEXTRUDEDAREASOLID'], ['#28326,#17,#9,3657.6']]

Sometimes these lines have arbitrary line breaks, especially among the properties, so I put some \s in the regex. This however makes for an interesting bug. The pattern now matches TWO rows into every match.

How can I adjust the regex to only catch one row even if they have line breaks? And just for curiosity, why does it stop after the second line and not continuing until last line?

mottosson
  • 3,283
  • 4
  • 35
  • 73
  • 1
    It seems to me you wanted to use something like [`#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(([\s\S]*?)\);`](https://regex101.com/r/RHIu0r/3). Or [`^#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(([\s\S]*?)\);$`](https://regex101.com/r/RHIu0r/4) (with multiline option). – Wiktor Stribiżew Jan 18 '17 at 09:21
  • @WiktorStribiżew Thank you so much! This seems to work. Add answer and I'll accept it as correct. Bonus points for speed :) – mottosson Jan 18 '17 at 09:26
  • Yes, sure, I will add explanations. – Wiktor Stribiżew Jan 18 '17 at 09:32
  • I think using \S is a bit overkill – Mustofa Rizwan Jan 18 '17 at 09:33
  • @Maverick_Mrt: That is not "overkill", `[\s\S]*?` / `(?s:.*?)` is the correct way (one of) to match an unknown string up to the first occurrence of a multicharacter delimiter. There is a way to make it more efficient by unrolling it, but usually, people get scared when they see lookaheads inside quantified groups. – Wiktor Stribiżew Jan 18 '17 at 09:39
  • @mottosson: Can the values be wrapped in double quotes, too? Can there be escaped quotes inside? – Wiktor Stribiżew Jan 18 '17 at 11:40
  • String properties are always wrapped in single quotes, but can contain single quotes as part of the string but will be escaped by another single quote like so: 'this is a ''string''<- two single quotes', to not end the string prematurely. – mottosson Jan 18 '17 at 12:08

2 Answers2

2

The reason why you now match 2 lines every time is that \s matches any whitespace, and if there is a line break after a line matched, the \s* will grab them all.

Use

/^#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(((?:'[^']*'|[^;'])+)\);/gm

See this regex demo

Details:

  • ^ - start of a line
  • # - a hash symbol
  • (\d+) - Group 1: one or more digits
  • \s*=\s* - a = enclosed with optional whitespaces
  • ([a-zA-Z0-9]+) - Group 2 capturing 1+ alphanumerics
  • \s*\( - 0+ whitespaces and a (
  • ((?:'[^']*'|[^;'])+) - Group 3 capturing either '...' substrings ('[^']*', with no ' inside allowed) or (|) 1+ chars other than ; and ' ([^;']+)
  • \); - a ); sequence

A negated character class solution suggested by Maverick_Mrt is good for specific cases, but once the text captured with ([\s\S]*?) contains the negated char, the match will get failed.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • If we consider patterns, I think it is not that hard to consider that ';' won't exist inside the bracket. – Mustofa Rizwan Jan 18 '17 at 09:40
  • BTW, an unrolled version is [`/^#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(([^)]*(?:\)(?!;$)[^)]*)*)\);$/gm`](https://regex101.com/r/ebOdJL/1) – Wiktor Stribiżew Jan 18 '17 at 09:40
  • @Maverick_Mrt: I prefer generalized approaches. We do not know if the semi-colon is always absent. It looks like some Excel(?) formula, and in some locales a semi-colon *is* used there. – Wiktor Stribiżew Jan 18 '17 at 09:42
  • #1=IFCOWNERHISTORY(#89024,#44585,$,.NOCHANGE.,$,$,$,11907208);\n90); this will fail if I assume anything can be there as per your approach.... https://regex101.com/r/RHIu0r/8 – Mustofa Rizwan Jan 18 '17 at 09:43
  • @Maverick_Mrt That is not a valid STEP-line. Semicolon always ends a line except for when they are enclosed in a string ';;;;'. And the 90);-part should be ignored. – mottosson Jan 18 '17 at 09:48
  • As per the accepted regex, quote is not considered – Mustofa Rizwan Jan 18 '17 at 09:49
  • @mottosson: Ok, if the `#1=IFCOWNERHISTORY(#89024;#44585;$;.NOCHANGE.;$;$;$;11907208);` is not a valid STEP-line, Maverick's approach might turn out more suitable for your scenario. – Wiktor Stribiżew Jan 18 '17 at 09:52
  • Since the regex has to be able to capture semicolons inside apostrophes it will break if there are a line with a string property like this: #2=IFCSPACE(';;;',#1,$); – mottosson Jan 18 '17 at 10:08
  • @mottosson: I edited the answer to account for cases when no escape sequences are allowed in your input. However, if there are any escape sequences allowed, you might need a more complex [`/^#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(((?:'[^'\\]*(?:\\.[^'\\]*)*'|\\.|[^;'])+)\);/gm`](https://regex101.com/r/RHIu0r/11). – Wiktor Stribiżew Jan 18 '17 at 11:52
1

You can try this:

#(\d+)\s*=\s*([a-z0-9]+)\s*\([^;]*\);

Your updated link

Mustofa Rizwan
  • 10,215
  • 2
  • 28
  • 43