7

This sample data is returned by Web Service

200,6, "California, USA"

I want to split them using split(",") and tried to see the result using simple code.

String loc = "200,6,\"California, USA\"";       
String[] s = loc.split(",");

for(String f : s)
   System.out.println(f);

Unfortunately this is the result

200
6
"California
 USA"

The expected result should be

200
6
"California, USA"

I tried different regular expressions and no luck. Is it possible to escape the given regular expression inside of "" ?

UPDATE 1: Added C# Code

UPDATE 2: Removed C# Code

  • Do you expect to see more than one quoted item on the same line? – Sergey Kalinichenko Feb 04 '13 at 03:47
  • Hmmm. Only sentence/words inside of `" "` –  Feb 04 '13 at 03:49
  • possible duplicate of [Parsing CSV input with a RegEx in java](http://stackoverflow.com/questions/1441556/parsing-csv-input-with-a-regex-in-java) – assylias Feb 04 '13 at 03:53
  • In C#, you should be using `string.Split`, not `Regex.Split`. In any case, your desired result can't be achieved with the split function (in either language) - reading the documentation for those functions, you won't see any indication that they respect quotation marks or other textual conventions. – prprcupofcoffee Feb 04 '13 at 04:03

4 Answers4

3
,(?=(?:[^"]|"[^"]*")*$)

This is the regex you want (To put it in the split function you'll need to escape the quotes in the string)

Explanation

You need to find all ','s not in quotes.. That is you need lookahead (http://www.regular-expressions.info/lookaround.html) to see whether your current matching comma is within quotes or out.

To do that we use lookahead to basically ensure the current matching ',' is followed by an EVEN number of '"' characters (meaning that it lies outside quotes)

So (?:[^"]|"[^"]*")*$ means match only when there are non quote characters till the end OR a pair of quotes with anything in between them

(?=(?:[^"]|"[^"]*")*$) will lookahead for the above match

,(?=(?:[^"]|"[^"]*")*$) and finally this will match all ',' with the above lookahead

  • Even number of quotes ahead does not necessarily mean ""outside of quotes"" (assuming quotes can be nested like brackets). as an example, see the previous sentence. – G. Bach Feb 04 '13 at 04:20
  • You would allow `"sdfdsf"sdfsdf"sdfsdf"sdfsdf"sdf"` as a token, but is it even valid CSV? – nhahtdh Feb 04 '13 at 10:20
2

An easier solution might be to use an existing library, such as OpenCSV to parse your data. This can be accomplished in two lines using this library:

CSVParser parser = new CSVParser();
String [] data = parser.parseLine(inputLine);

This will become especially important if you have more complex CSV values coming back in the future (multiline values, or values with escaped quotes inside an element, etc). If you don't want to add the dependency, you could always use their code as a reference (though it is not based on RegEx)

JohnnyO
  • 3,018
  • 18
  • 30
0

If there's a good lexer/parser library for Java, you could define a lexer like the following pseudo-lexer code:

Delimiter: ,
Item: ([^,"]+) | ("[^,"]+")
Data: Item Delimiter Data | Item 

How lexers work is that it starts at the top level token definition (in this case Data) and attempts to form tokens out of the string until it cannot or until the string is all gone. So in the case of your string the following would happen:

  • I want to make Data out of 200,6, "California, USA".
  • I can make Data out of an Item, a Delimiter and Data.
  • I looked - 200 is an Item and then , is a Delimiter so I can tokenize that and keep going.
  • I want to make data out of 6, "California, USA"
  • I can make Data out of an Item, a Delimiter and Data.
  • I looked - 6 is an Item and then , is a Delimiter so I can tokenize that and keep going.
  • I want to make data out of "California, USA"
  • I can make Data out of an Item, a Delimiter and Data.
  • I looked - "California, USA" is an Item, but I see no Delimiter after it, so let's try something else.
  • I can make Data out of an Item.
  • I looked - "California, USA" is an item, so I can tokenize that and keep going.
  • The string is empty. I'm done. Here's your tokens.

(I learned about how lexers work from the guide to PLY, a Python lexer/parser: http://www.dabeaz.com/ply/ply.html )

Patashu
  • 21,443
  • 3
  • 45
  • 53
0

Hello Try this Expression.

public class Test {

    /**
     * @param args
     */
    public static void main(String[] args) {
        String loc = "200,6,\"Paris, France\"";  
        String[] str1 =loc.split(",(?=(?:[^\"]|\"[^\"]*\")*$)");

        for(String tmp : str1 ){
            System.out.println(tmp);
        }

    }

}
Abin Manathoor Devasia
  • 1,945
  • 2
  • 21
  • 47