2

TLDR: I'm looking to capture everything outside of quotation marks, but I seem to fail to do so in Java with this regex \"|"(?:\"|[^"])*"|([^\"]+) while it works on websites such as http://myregexp.com/. Can anyone point me what I'm doing wrong ?

Hi, I'm currently trying to analyse a .java source code and extract as a string everything outside quotation marks (ignoring escaped quotes).

For example, in this string :

This should be captured "not this" and "not \"this\" either".

I should be able with, pattern and matcher, to find "This should be captured", "and", ".".

What I currently have is \"[^\"]+\"|([^\"]+), which works well if there is an equal pair of "" in the document but breaks as soon as there is an escaped one.

On an online regex testers, I tried \"|"(?:\"|[^"])*"|([^\"]+) which seems to do exactly what I'm looking for, but when I try it in Java it doesn't.

VLAZ
  • 26,331
  • 9
  • 49
  • 67
Beerbossa
  • 110
  • 1
  • 8
  • Try `List[] res = s.split("\\s*\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"\\s*");` – Wiktor Stribiżew Jan 26 '17 at 07:22
  • See https://ideone.com/NgMozq. – Wiktor Stribiżew Jan 26 '17 at 07:39
  • This works well, thanks a lot ! I'll try to understand how the regex works and apply it for commented source code as well (such as /* */, /** **/ and // \n). – Beerbossa Jan 26 '17 at 17:01
  • Show us your actual Java code -- and ideally a failing test method too. – slim Jan 26 '17 at 17:12
  • Note that the regex for `/*...*/` like comments in Java is [posted by me here](http://stackoverflow.com/a/36328890/3832970). – Wiktor Stribiżew Jan 26 '17 at 17:16
  • I made this one before,reading your comment, to cover the multi line comments case. Sorry for asking so many questions but would you explain to me what's wrong with this version ? It seems to work fine and is shorter, but I assume there are some cases that I forgot to consider. **"/\\**(?:[\\S\\s]*?)\\*/"** – Beerbossa Jan 26 '17 at 17:49
  • *Is shorter* is no argument in the regex world when you have to match strings of arbitrary length. Use your pattern and once your code freezes, you will switch to my version. Lazy (and greedy) quantifiers [take their toll](http://stackoverflow.com/questions/35759287/why-are-greedy-quantifiers-less-expensive-than-lazy-quantifiers). [Unrolling the loop](http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop) is a life savior. – Wiktor Stribiżew Jan 26 '17 at 18:22

2 Answers2

2

It seems for your current task, you may use a pattern to match double quoted string literals to split the string:

List[] res = s.split("\\s*\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"\\s*");

See the Java demo:

String s = "This should be captured \"not this\" and \"not \\\"this\\\" either\".";
String[] res = s.split("\\s*\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"\\s*");
System.out.println(Arrays.toString(res));
// => [This should be captured, and, .]

Pattern details:

  • \\s* - 0+ whitespaces
  • \" - a double quote
  • [^\"\\\\]* - 0+ chars other than " and \
  • (?:\\\\.[^\"\\\\]*)* - 0+ sequences of:
    • \\\\. - a \ and any char other than line break chars
    • [^\"\\\\]* - 0+ chars other than " and \
  • \"\\s* - a " and 0+ whitespaces
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks, this was pretty must exactly what I was looking for ! It helped me to understand the syntax better and I can now try to build my own for the remaining steps that I need. – Beerbossa Jan 26 '17 at 17:50
0
String s = "This should be captured \"not this\" and \"not \\\"this\\\" either\".";
String[] res = s.split("\"([^\"]*)\"");
System.out.println(Arrays.toString(res));

This is a comparatively shorter regex pattern matching expression.

Ela Singh
  • 36
  • 1
  • 7