-1

I want to write a regex for splitting some protocol I have (up to) 4 sections, divided by ;. My problem is that if said ; is between quotation marks, I want to ignore it. How do I do that? I don't want my groups to terminate when the ; is between quotation marks.

This is what I've got so far -

((?:.)+?;)

example input -

_-₪* #,##0.00_-;-₪* #,##0.00_-;" ; asd"_-₪* "-"??_-;_-@_-

should return

group1 - _-₪* #,##0.00_-;
group2 - -₪* #,##0.00_-;
group3 - " ; asd"_-₪* "-"??_-;
group4 - _-@_-

Thanks

alonkh2
  • 533
  • 2
  • 11
  • You might do that like this `"[^"*]*"|([^;]+)` https://regex101.com/r/aUf7Vm/1 but why should `group3 - " ; asd"_-₪* "-"??_-;` be a match as there is `;` in the string? – The fourth bird May 22 '22 at 14:32
  • because I want to ignore the `;` in the string, that's the point. – alonkh2 May 22 '22 at 14:35
  • 1
    Maybe just `(?:"[^"]*"|[^;])+`? See [this regex demo](https://regex101.com/r/5UHlFX/1). – Wiktor Stribiżew May 22 '22 at 14:46
  • This sounds like the sort of thing where regex will torture you (because it's not designed for this sort of thing) and where you'd be better off writing a simple parser to split things out according to your own logic. – Bobulous May 22 '22 at 14:53
  • My Java IDE isn't setup here, so I can't knock together an example, but my suggestion is this: simple loop iterating through each character, and keep track of a boolean called `insideQuotes` which is set to true on finding a quote, and false on finding the next (closing) quote. When you find a semicolon and `insideQuotes` is false, then add the index to an ArrayList. Once the character loop has finished, use substring to pull out the pieces around each semicolon index. Customisable, and probably more efficient than engaging the regex engine. – Bobulous May 22 '22 at 15:02
  • Thanks all, Wiktor's solution seems to work – alonkh2 May 23 '22 at 09:02

1 Answers1

1

You can use

(?:"[^"]*"|[^;])+

See the regex demo.

Details

  • (?: - start of a non-capturing group:
    • " - a " char
    • [^"]* - any zero or more (*) chars other than a " char ([^...] is a negated character class)
    • " - a " char
  • | - or
    • [^;] - any char other than a ; char
  • )+ - end of the non-capturing group, repeat one or more times (+).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563