1

Yes I know this has been asked a lot but there is no solution I've found for what exactly I'm trying to do. So please allow me to explain what my problem is.

I need to find a way so tokenize a string based on ',' , '.', white space, and between quotes without applying other regex rules between the quotes.

Allow this '[]' to represent a single space for these examples.

Suppose I have a string like this:

ADD[]r2,[]r3

Now with a regex like this:

((?<=\s)|(?=\s+))|((?<=,))|(?=\.)

I can split the string like so:

1: ADD
2: []
3: r2,
4: []
5: r3

This is what I want.

Now suppose I have a string like this:

"ADD[]r2,[]r3"[]"foo[]bar"

Now with a regex like this:

(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)

I can split the string like this:

1: "ADD[]r2,[]r3"
2: []
3: "foo[]bar"

But if I had a string like this:

ADD[]r2,[]r3[]"ADD[]r2,[]r3"

And used a regex like this:

(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)((?<=\s)|(?=\s+))|((?<=,))|(?=\.)

I would end up with something like this:

1:ADD
2:[]
3:r2,
4:[]
5:r3
6:[]
7:"Add[]r2,
8:[] r3"

But what I want is this:

1:ADD
2:[]
3:r2,
4:[]
5:r3
6:[]
7:"Add[]r2,[]r3"

Is it possible to do this with a regex? Or do I need to do something more complex? What I'm trying to do is basically make a regex to split up code syntax. I just need a way to split up a line like I have described.

Any help or suggestions would be greatly appreciated.

EDIT: Example drive code of what I'm trying to do

 String line = "ADD r2, r3 \"ADD r2, r3\"";
        String[] arrLine = line.substring(0, line.length()).split("(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)((?<=\\s)|(?=\\s+))|((?<=,))|(?=\\.)");

        for(int i = 0; i < arrLine.length; i++) {
            System.out.println(arrLine[i]);
        }
boardkeystown
  • 180
  • 1
  • 11
  • 1
    Like this? `"[^"]*"|\h|[^\s"]+` https://regex101.com/r/5y77KP/1 – The fourth bird Feb 11 '21 at 23:23
  • 1
    Yes, or [`"[^"]*"|[^\s"]+|\s+`](https://regex101.com/r/Qanl1i/1) – Wiktor Stribiżew Feb 11 '21 at 23:25
  • @Thefourthbird no. See the edit I made to show you how I'm splitting the string. – boardkeystown Feb 11 '21 at 23:42
  • @WiktorStribiżew See edit I made. – boardkeystown Feb 11 '21 at 23:42
  • @boardkeystown Why don't you match it instead? See https://ideone.com/d95sbB – The fourth bird Feb 11 '21 at 23:43
  • You should use an *extraction* approach here, see [Create array of regex matches](https://stackoverflow.com/questions/6020384/create-array-of-regex-matches) – Wiktor Stribiżew Feb 11 '21 at 23:43
  • Regexes aren't very good for problems like this. The next thing you'll probably want is for `\"` to be a non-terminating string char. Then things get _really_ complicated. Imo it's better to give up early and write a finite-automaton-based scanner that will handle all future contingencies easily without solving regex puzzles. – Gene Feb 11 '21 at 23:53
  • @Gene you are right... I would need to account for that case at some point. What do you mean by a finite-automaton-based scanner? I'm going to Google it right after I finish this comment. I know what a finite automaton is but never used it for anything in java. – boardkeystown Feb 12 '21 at 00:09
  • Personally, for something like this I'd write a simple 2-state (IDLE, IN_QUOTE) 4-input (DOT, COMMA, QUOTE, OTHER) DFA and process the input one character at a time. I have a feeling you don't completely know the grammar you're trying to parse, and I'd much rather maintain a DFA than horribly complex regular expressions. (A DFA would be a lot faster too, but that's premature optimization :-) – Jim Garrison Feb 12 '21 at 01:40

2 Answers2

1

Instead of using split, you can match either from an opening till closing double quote, or match whitespace characters, or match all characters except whitespaces and double quotes.

In Java you can use \h to match a horizontal whitespace char, or use \s to match a whitespace char that could also match a newline.

"[^"]*"|\h+|[^\h"]+

Regex demo | Java demo

In Java

String regex = "\"[^\"]*\"|\\h+|[^\\h\"]+";
String string = "ADD r2, r3 \"ADD r2, r3\"";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println(matcher.group(0));
}

Output

ADD
 
r2,
 
r3
 
"ADD r2, r3"
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • The reason I need the split is because I have a larger algo that requires all of those things to be split by white space and account for a case if there is not white space. Also yours would not handle the case if "ADD r2,r3" b/c r2,r3 would be 1 split and not 2. – boardkeystown Feb 12 '21 at 00:11
  • @boardkeystown This `"ADD r2,r3" b/c r2,r3` contains 2 spaces not between double quotes. Why should there be 1 split and where should the split be? – The fourth bird Feb 12 '21 at 00:15
  • You could extend it for example not matching newlines between the double quotes and matching and escaped double quote within the double quotes. https://regex101.com/r/4Nu2Qu/1 – The fourth bird Feb 12 '21 at 00:35
  • I want to thank you for the suggestion. However doing it this way did not work out for my needs. – boardkeystown Feb 19 '21 at 03:25
1

When I see a problem like this in general I immediately think to break it down into two or more simpler problems. The other thing that occurs to me is that your problem may get more complicated. It might be worth thinking about ANTLR here.

Jonathan Locke
  • 243
  • 2
  • 6
  • You are right I needed to break it down into smaller steps. I had to write my own tokenizer to do exactly what I wanted. – boardkeystown Feb 19 '21 at 03:20