Regular Expressions Count in Lookbehind

Question

I'm trying the following: I have a string, that can look like this: 'a, b, (c, d, (e, f), g), (h, i)' and I want to split it at the commas that resemble the first layer:

a b (c, d, (e, f), g) (h, i)

I just can't figure out how to do this. The logical solution I got was, I have to find the commas, which have the same amount of opening and closing brackets behind them. How can I implement this with regular expressions?

Best Regards

Not sure it will work in Matlab - https://regex101.com/r/4JUPOX/1 — Wiktor Stribiżew, Dec 20 '16 at 19:56
Seems that you need to match nested parentheses. Maybe this question will help? http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns — cjaube, Dec 20 '16 at 20:02
@WiktorStribiżew unfortunately no. Matlab does not seem to like the recursion — Dustin Lehmann, Dec 20 '16 at 21:03

rahnema1 · Answer 1 · 2016-12-21T09:12:34.483

A solution without regex:

a = 'a, b, (c, d, (e, f), g), (h, i)';
a(cumsum((a=='(')-(a==')'))==0 & a==',')=';'
out = strsplit(a, ';')

result:

{
  [1,1] = a
  [1,2] =  b
  [1,3] =  (c, d, (e, f), g)
  [1,4] =  (h, i)
}

we can find level of nesting of each character using

cumsum((a=='(')-(a==')'));

array of nesting level:

0000001111111222221111000111110

so for example first 6 characters 'a, b, ' are in the 0th level and so on.
and we only require those characters that are in the 0 level

cumsum((a=='(')-(a==')'))==0

and also they should be commas

cumsum((a=='(')-(a==')'))==0 & a==','

set all commas that are in 0 level to ';'

a(cumsum((a=='(')-(a==')'))==0 & a==',')=';'

and split the string

strsplit(a, ';')

gnovice · Answer 2 · 2016-12-20T22:54:27.347

Here are a couple of options:

Option 1: If your data has a consistent pattern of commas and parentheses across rows, you can actually parse it quite easy with a regex. The downside is that if your pattern changes, you have to change the regex. But it's also quite fast (even for very large cell arrays):

str = {'(0, 0, 1540.4, (true, (121.96, 5)), 5.7068, 1587.0)';
       '(0, 0, 1537.5, (true, (121.93, 6)), 5.7068, 1587.0)';
       '(0, 0, 1537.5, (true, (121.93, 3)), 5.7068, 1587.0)';
       '(0, 0, 1537.5, (true, (121.93, 4)), 5.7068, 1587.0)';
       '(0, 0, 1537.5, (true, (121.93, 5)), 6.0965, 1587.0)';
       '(0, 0, 1535.2, (true, (121.9, 6)), 6.0965, 1587.0)';
       '(0, 0, 1535.2, (true, (121.9, 3)), 6.0965, 1587.0)';
       '(0, 0, 1535.2, (true, (121.9, 4)), 6.0965, 1587.0)';
       '(0, 0, 1535.2, (true, (121.9, 5)), 6.3782, 1587.0)';
       '(0, 0, 1532.3, (true, (121.87, 6)), 6.3782, 1587.0)'};

tokens = regexp(str, ['^\(([-\d\.]+), ' ... % Column 1
                         '([-\d\.]+), ' ... % Column 2
                         '([-\d\.]+), ' ... % Column 3
                         '(\(\w+, \([-\d\.]+, [-\d\.]\)\)), ' ... % Column 4
                         '([-\d\.]+), ' ... % Column 5
                         '([-\d\.]+))'], ... % Column 6
                'tokens', 'once');
str = vertcat(tokens{:});
disp(str);

And the result for this example:

'0'    '0'    '1540.4'    '(true, (121.96, 5))'    '5.7068'    '1587.0'
'0'    '0'    '1537.5'    '(true, (121.93, 6))'    '5.7068'    '1587.0'
'0'    '0'    '1537.5'    '(true, (121.93, 3))'    '5.7068'    '1587.0'
'0'    '0'    '1537.5'    '(true, (121.93, 4))'    '5.7068'    '1587.0'
'0'    '0'    '1537.5'    '(true, (121.93, 5))'    '6.0965'    '1587.0'
'0'    '0'    '1535.2'    '(true, (121.9, 6))'     '6.0965'    '1587.0'
'0'    '0'    '1535.2'    '(true, (121.9, 3))'     '6.0965'    '1587.0'
'0'    '0'    '1535.2'    '(true, (121.9, 4))'     '6.0965'    '1587.0'
'0'    '0'    '1535.2'    '(true, (121.9, 5))'     '6.3782'    '1587.0'
'0'    '0'    '1532.3'    '(true, (121.87, 6))'    '6.3782'    '1587.0'

Note that I used the pattern [-\d\.]+ to match an arbitrary number which may have a negative sign or decimal point.

Option 2: You can use regexprep to repeatedly remove pairs of parentheses that don't contain other parentheses, replacing them with whitespace to maintain the same size string. Then, find the positions of the commas in the final processed string and break up the original string using these positions. You won't have to change the regex for each new pattern of commas and parentheses, but this will be a little slower than the above (but still only taking a second or two for arrays of up to 15,000 cells):

% Using raw str from above:
str = cellfun(@(s) {s(2:end-1)}, str);
tempStr = str;
modStr = regexprep(tempStr, '(\([^\(\)]*\))', '${blanks(numel($0))}');
while ~isequal(modStr, tempStr)
  tempStr = modStr;
  modStr = regexprep(tempStr, '(\([^\(\)]*\))', '${blanks(numel($0))}');
end

commaIndex = regexp(tempStr, ',');
str = cellfun(@(v, s) {mat2cell(s, 1, diff([1 v numel(s)+1]))}, commaIndex, str);
str = strtrim(strip(vertcat(str{:}), ','));
disp(str);

This gives the same result as Option 1.

Thanks for the fast reply! Unfortunately this will take too long. I need to use this on cell arrays, that have around 15.000 entries, all with the same comma-bracket structure but the numbers in between have different sizes. So i cannot do it just on one and save the comma position. I made myself a function to generate regexps for each element, but it takes too long. I hoped to find a simple solution to just cut the string at the right position — Dustin Lehmann, Dec 20 '16 at 20:38
@DustinLehmann: A smaller version of the data you are actually working with would be helpful. Do you have a large cell array that contains strings that resemble what you have in your question? Perhaps showing us the first 5-10 cell contents would help. — gnovice, Dec 20 '16 at 20:45
cell_array = { '(0, 0, 1540.4, (true, (121.96, 5)), 5.7068, 1587.0)'; '(0, 0, 1537.5, (true, (121.93, 6)), 5.7068, 1587.0)'; '(0, 0, 1537.5, (true, (121.93, 3)), 5.7068, 1587.0)'; '(0, 0, 1537.5, (true, (121.93, 4,)) 5.7068, 1587.0)'; '(0, 0, 1537.5, (true, (121.93, 5,)) 6.0965, 1587.0)'; '(0, 0, 1535.2, (true, (121.9, 6)), 6.0965, 1587.0)'; '(0, 0, 1535.2, (true, (121.9, 3)), 6.0965, 1587.0)'; '(0, 0, 1535.2, (true, (121.9, 4)), 6.0965, 1587.0)'; '(0, 0, 1535.2, (true, (121.9, 5)), 6.3782, 1587.0)'; '(0, 0, 1532.3, (true, (121.87, 6)), 6.3782, 1587.0)'} — Dustin Lehmann, Dec 20 '16 at 20:53
I first need to cut away the first and last character (the opening and closing brackets). After that, I need a new cell array, that has the same amount of rows, but in each column is the cut away portion at the top level comma. So the first row of the new cell array would have the 6 column entries: 0 0 1540.4 (true, (121.96, 5)) 5.7068 1587.0 — Dustin Lehmann, Dec 20 '16 at 20:55
@DustinLehmann: Will each row have the same number of columns, or does that vary? Also, does the pattern of where the commas and parentheses are differ from row to row (it doesn't in the example)? — gnovice, Dec 20 '16 at 21:34
No the number of columns does not vary and the comma and parentheses pattern does not differ, too. This is a time vector of logged signals from a Simulation, which has this weird output format with parentheses for bus signals — Dustin Lehmann, Dec 20 '16 at 21:41
Thanks! Unfortunately my answer was a little bit ambigious. The comma and parentheses pattern will not vary throughout an array, but is different for each simulation. But I guess I will write a function that will generate your postet approach for one simulationvector and will apply it to the whole cell array — Dustin Lehmann, Dec 20 '16 at 22:31
@DustinLehmann: OK (phew!) I added one more option that is a little slower, but accounts for changes in the pattern between simulations automatically. — gnovice, Dec 20 '16 at 22:55

score 0 · Answer 3 · answered Dec 20 '16 at 22:35

I know the question says how to implement it with regular expressions, but if you were you parse it character by character, you could simply keep track of the nest level as you go. Here is a javascript snipit to demonstrate how it might be done (https://jsfiddle.net/jf65k0jc/):

var str = 'a, b, (c, d, (e, f), g), (h, i)';
var arr = [];
var buffer = '';
var level = 0;
for (var i = 0; i < str.length; i++) {
  var letter = str[i];

  if (level === 0) {
    if (letter === ',') {
      arr.push(buffer.trim());
      buffer = '';
    }
    else {
      buffer += letter;
      if (letter === '(') {
        level++;
      }
    }
  }
  else {
    buffer += letter;
    if (letter === '(') {
      level++;
    }
    else if (letter === ')') {
      level--;
    }
  }
}
arr.push(buffer.trim());

var output = '';
for (var i = 0; i < arr.length; i++) {
  output += arr[i] + '<br>';
}
$('.output').html(output);

// Outputs:
// a
// b
// (c, d, (e, f), g)
// (h, i)

Hi, thanks for the reply. I already tried it this way, and it works, but it's too slow for a vector with ~ 15.000 entries — Dustin Lehmann, Dec 20 '16 at 22:37

Regular Expressions Count in Lookbehind

3 Answers3