1

I have a CSV file with data from multiple variables, and I would like to separate them. My file looks like this:

VARIABLE: GRP1.SGRP1.100:VAR1

Timestamp (LOCAL_TIME),Value
2018-07-18 13:52:09.100,25582
2018-07-18 13:52:49.900,24048
VARIABLE: GRP1.SGRP2.102:VAR1

Timestamp (LOCAL_TIME),Value
2018-07-18 13:52:09.100,25582
2018-07-18 13:52:49.900,24048

And I would like to split it on every occurrence of the substring "VARIABLE", producing two strings:

VARIABLE: GRP1.SGRP1.100:VAR1

Timestamp (LOCAL_TIME),Value
2018-07-18 13:52:09.100,25582
2018-07-18 13:52:49.900,24048

and

VARIABLE: GRP1.SGRP2.102:VAR1

Timestamp (LOCAL_TIME),Value
2018-07-18 13:52:09.100,25582
2018-07-18 13:52:49.900,24048

Something similar to VARIABLE[^V]+ would seem to work, but it should somehow terminate on the next occurrence, which I cannot figure out how. Thanks

1 Answers1

4

You may use two approaches, matching and splitting.

Splitting is an easier approach, since the pattern will look like (?!^)(?=VARIABLE), but there is one caveat: Matlab regex expects non-empty matches by default (noemptymatch option is default). You need to pass the emptymatch option to regexp function for it to work:

splitStr = regexp(str,'\s*(?!^)(?=VARIABLE)','split', 'emptymatch')

Output:

splitStr = 
{
  [1,1] = VARIABLE: GRP1.SGRP1.100:VAR1

Timestamp (LOCAL_TIME),Value
2018-07-18 13:52:09.100,25582
2018-07-18 13:52:49.900,24048

  [1,2] = VARIABLE: GRP1.SGRP2.102:VAR1


Timestamp (LOCAL_TIME),Value
2018-07-18 13:52:09.100,25582
2018-07-18 13:52:49.900,24048
}

The (?!^)(?=VARIABLE) pattern matches any location in string that is not at the start of the string but that is immediately followed with a VARIABLE substring.

Alternatively, you may match VARIABLE and then any amount of non-Vs or Vs that are not followed with ARIBALE:

matchStr = regexp(str,'VARIABLE[^V]*(?:V(?!ARIABLE)[^V]*)*','match')

See the regex demo.

Details

  • VARIABLE - a VARIABLE substring
  • [^V]* - a negated character class matching 0 or more chars other thatn V
  • (?:V(?!ARIABLE)[^V]*)* - zero or more consecutive occurrences of
    • V - a V char that is
    • (?!ARIABLE) - ... not followed with ARIABLE
    • [^V]* - 0 or more chars other than V.

Note it is "lexically" the same as VARIABLE(?:(?!VARIABLE).)* (with a tempered greedy token) or VARIABLE.*?(?=VARIABLE|$) (with lazy dot pattern and a mere positive lookahead), but is more efficient since it follows the unroll the loop principle. (Note that . in Matlab regex matches any char including newlines, so no need to use any additional flags when using these two patterns in Matlab).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563