-4

I need a regex to parse a string, which needs to be split by commas... the commas to be used as the split can only match commas not inside quotes...

should be 3: 3 (is right)
should be 3: 14 (is wrong, counted commas inside quotes)
should be 24: 12 (is wrong)
should be 24: 24. (is right)

For the following results test case:

String line ="com.day.image;uses:=\"javax.imageio.stream,javax.imageio.spi,javax.imageio.plugins.jpeg,org.slf4j,javax.imageio.metadata,javax.imageio,com.day.imageio.plugins,com.day.image.font\",com.day.imageio.plugins;uses:=\"javax.imageio,javax.imageio.metadata,javax.imageio.stream,javax.imageio.spi,org.w3c.dom\",com.day.image.font;uses:=\"com.day.image\"";

        String[] results1 = line.split("\",");
        String[] results2 = line.split(",");

        System.out.println("should be 3: "+ results1.length);
        System.out.println("should be 3: "+ results2.length);

        line = "com.day.cq.commons,com.day.cq.commons.inherit,com.day.cq.wcm.api,com.day.cq.wcm.api.components,com.day.cq.wcm.api.designer,com.day.cq.wcm.commons,com.day.cq.wcm.tags,com.day.cq.widget,javax.servlet,javax.servlet.http,javax.servlet.jsp;version=\"2.1\",javax.servlet.jsp.el;version=\"2.1\",javax.servlet.jsp.jstl.core,javax.servlet.jsp.jstl.fmt,javax.servlet.jsp.tagext;version=\"2.1\",org.apache.commons.lang;version=\"2.4\",org.apache.sling.api;version=\"2.1\",org.apache.sling.api.request;version=\"2.1\",org.apache.sling.api.resource;version=\"2.1\",org.apache.sling.api.scripting;version=\"2.1\",org.apache.sling.api.servlets;version=\"2.1\",org.apache.sling.scripting.jsp.taglib;version=\"2.0\",org.apache.sling.scripting.jsp.util;version=\"2.0\",org.slf4j;version=\"1.5\"";

        results1 = line.split("\",");
        results2 = line.split(",");

        System.out.println("should be 24: "+ results1.length);
        System.out.println("should be 24: "+ results2.length);

the output is,

should be 3: 3
should be 3: 14
should be 24: 12
should be 24: 24

UPDATED

I understand very well what I need, but I didn't know how to do it.. my explanation what I was trying to accomplish wasn't the best. A bad defined problem, hardly would lead to solutions. One of my faculties is to simply complex scenarios, obviously tonight wasn't for me.

After searching I refine my question again, Google search term: "How do I match a character outside of quotes?"

Now is well know Google first results should be the most probably you look for, if you ASK the RIGHT question too Google ;).

Firsts result, Regex to pick commas outside of quotes

The regular expression would be this: (,)(?=(?:[^"']|["|'][^"']")$).

tested and worked..

Finally I assume there a difference between, programming skills, understanding skills, definitely they are not carried together by many programmers out there.. I asked in several places, and most people say that it was not possible... apparently it is.

Thanks for your time, and sorry maybe the rush to get the help.

This site is GREAT! :)

UPDATE2

This regex (,)(?=(?:[^"']|["|'][^"']")$). is giving me problem of StackOverFlow..!!

at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)

Apparently it works fine for some inputs but not others! Or is the Java Regex engine buggy?

UPDATE3

This Regex do not overflows and works(java escaped): "(,)(?=(?:[^\"]|\"[^\"]\")$)"

Community
  • 1
  • 1
yxz97
  • 15
  • 2
  • 1
    What is your expected result? – FThompson Dec 07 '12 at 03:16
  • Wanting `line.split("\\\"");` or some such? – Eugene Ryabtsev Dec 07 '12 at 03:18
  • 1
    Those are the correct results. Maybe if you'd explain what you're trying to do. – Edward Falk Dec 07 '12 at 03:19
  • I updated my question, you are right apologized! – yxz97 Dec 07 '12 at 03:30
  • Hm, that is a clever solution. You have one `|` in the wrong place, though--character classes (`[]`) shouldn't have bars unless they're actually part of the character class. It looks like you copied and pasted from the first comment on the answer you linked to, rather from the answer itself--but subsequent comments point out that that expression has errors. – Kyle Strand Dec 07 '12 at 05:29
  • `(,)(?=(?:[^"]|"[^"]*")*$)(?=(?:[^']|'[^']*')*$)` should work if you want to check single-quotes as well. Though that's a bit of a hack. – Kyle Strand Dec 07 '12 at 05:30
  • java.util.regex.PatternSyntaxException: Unclosed group near index 46 (,)(?=(?:[^"]|"[^"]*")*$)(?=(?:[^']|'[^']*')*$ – yxz97 Dec 07 '12 at 05:42
  • Another updated, this regular expression avoids the overflow! "(,)(?=(?:[^\"]|\"[^\"]*\")*$)" – yxz97 Dec 07 '12 at 06:12
  • Should be two parentheses at the end (before the `*$`) in my suggestion. Sorry. – Kyle Strand Dec 07 '12 at 07:10

1 Answers1

2

Regex is not good for keeping track of whether something is "inside" or "outside" quotes, brackets, parentheses, etc; the best way to do this, therefore, might be to go through the string character by character, with a flag keeping track of whether or not the current character is inside a set of quotation marks (this flag would start false and switch on and off as quotation marks are encountered).

However, if you're sure you want to do this with regex, I would recommend first splitting the string by quotation marks (intermediate = line.split("\"");), then splitting each element in the intermediate list by commas, and then concatenating the results back together. The concatenation step will be a little tricky, since you'll want to combine the last element of each array with the first element of the next, separating them with a quotation mark.

Another possibility: first split the string by quotation marks, then replace each occurrence of a comma in the odd-numbered segments with some character sequence that appears nowhere else in the string (such as $split$), but leave the even-numbered segments (that is, those that represent quoted sections) alone. Recombine the segments into a single string (re-inserting the quotation marks between each pair of segments, of course), then split the string by instances of $split$.

Kyle Strand
  • 15,941
  • 8
  • 72
  • 167
  • are you sure this cannot be done with regex? which is the best way to loop a string in Java, I though regex would help in performance in time programming.. any way thanks! – yxz97 Dec 07 '12 at 03:59
  • As far as I know, regex generally doesn't show performance gains over explicitly crafted solutions for the particular problem you're trying to address; the speed gains are primarily in programming time, not execution time. Anyway, I don't think a single regex pattern will help you here no matter what, but I did add an alternate solution to my answer that you might find preferable. As for how to loop through the characters in a Java string, I'd recommend using a simple `for` loop with `charAt()`...though to be honest I'm not entirely sure how I'd use that construct. – Kyle Strand Dec 07 '12 at 04:09
  • Yeah, unfortunately there's no really good way to scan through a string in Java. Try writing a parser in it someday -- yuk. Also, regexes are notoriously slow in many cases like this one. – Edward Falk Dec 07 '12 at 04:18