70

I have a comma separated file with many lines similar to one below.

Sachin,,M,"Maths,Science,English",Need to improve in these subjects.

Quotes is used to escape the delimiter comma used to represent multiple values.

Now how do I split the above value on the comma delimiter using String.split() if at all its possible?

Maroun
  • 94,125
  • 30
  • 188
  • 241
FarSh018
  • 845
  • 2
  • 10
  • 12

4 Answers4

211
public static void main(String[] args) {
    String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
    String[] splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
    System.out.println(Arrays.toString(splitted));
}

Output:

[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]
Achintya Jha
  • 12,735
  • 2
  • 27
  • 39
  • 9
    It took me a while to figure out what this regex was doing. It would have helped me greatly to have the explanation that it matches commas that are followed by an even number of quotes (or no quotes). So this works because comma's inside quotes (i.e. the ones we don't want to match/split on) should have an odd number of quotes between them and the end of the line. It also might be worth noting that I believe this would not work if the data could have escaped quotes in it. – glyphx Nov 06 '14 at 15:39
  • 3
    Do this s.split(',(?=([^\"]*\"[^\"]*\")*[^\"]*$)', -1) if you want to preserve empty strings at the end. http://stackoverflow.com/questions/13939675/java-string-split-i-want-it-to-include-the-empty-strings-at-the-end – kctang Nov 27 '14 at 03:38
  • 1
    Very helpful. I needed to add `?:` to the inner group when doing this in javascript, so the full expression becomes `s.split(/,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)/);` – Marty Neal Jan 21 '16 at 18:25
  • How would I get this to work if some of my fields actually contain quotes in them too? It is working for everything except it is now splitting at where some of the names in my Customer fields have " ". Example is: Montgomery County Sheriff's Office KS "Montgomery PD, AL" where it is splitting at "Montgomery PD, AL" and putting it at it's own line but it should not :/ @glyphx – Ashton May 11 '16 at 17:13
  • 1
    @Ashton That's very strange... You are probably best off posting a new question with full details. Show an entire string you are trying to parse and the pattern you are using and the results. The pattern in this answer should only ever match and split on commas, as far as I understand it. – Ben May 17 '16 at 00:34
  • 1
    Explanation/Visualization of regex https://regexper.com/#(%3F%3D(%5B%5E%5C%22%5D*%5C%22%5B%5E%5C%22%5D*%5C%22)*%5B%5E%5C%22%5D*%24) – mtk Feb 02 '17 at 21:23
25

As your problem/requirements are not all that complex a custom method can be utilized that performs over 20 times faster and produces the same results. This is variable based on the data size and number of rows parsed, and for more complicated problems using regular expressions is a must.

import java.util.Arrays;
import java.util.ArrayList;
public class SplitTest {

public static void main(String[] args) {

    String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
    String[] splitted = null;

 //Measure Regular Expression
    long startTime = System.nanoTime();
    for(int i=0; i<10; i++)
    splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
    long endTime =   System.nanoTime();

    System.out.println("Took: " + (endTime-startTime));
    System.out.println(Arrays.toString(splitted));
    System.out.println("");


    ArrayList<String> sw = null;        
 //Measure Custom Method
            startTime = System.nanoTime();
    for(int i=0; i<10; i++)
    sw = customSplitSpecific(s);
    endTime =   System.nanoTime();

    System.out.println("Took: " + (endTime-startTime));
    System.out.println(sw);         
}

public static ArrayList<String> customSplitSpecific(String s)
{
    ArrayList<String> words = new ArrayList<String>();
    boolean notInsideComma = true;
    int start =0, end=0;
    for(int i=0; i<s.length()-1; i++)
    {
        if(s.charAt(i)==',' && notInsideComma)
        {
            words.add(s.substring(start,i));
            start = i+1;                
        }   
        else if(s.charAt(i)=='"')
        notInsideComma=!notInsideComma;
    }
    words.add(s.substring(start));
    return words;
}   

}

On my own computer this produces:

Took: 6651100
[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]

Took: 224179
[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]
Menelaos
  • 23,508
  • 18
  • 90
  • 155
  • 1
    -1 This does not answer the question, which specifically asks for a solution using `String.split()`. As an aside, one of the hallmarks of code written by someone who knows very little about java is the use of `Vector`. – Bohemian Apr 09 '13 at 22:08
  • 10
    Please explain why in this situation it would be more advantageous to use ArrayList instead of Vector (except for the performance hit due to thread safeness). Also, your politeness could use some work, which is one of the hallmarks of someone who is rude. – Menelaos Apr 09 '13 at 22:11
  • 2
    I wasn't being rude; merely factual. Here's a little tip... `Vector` is *not* threadsafe. It's a broken class, which is why no one, and I really do mean *no one*, uses it in the real world. Only total beginners use it, my guess is because lecture notes are ten years out of date, and especially because lecturers advocating the use of Vector have spent too much time in academia to keep in touch and the old adage "if you can't do it, teach it" still holds true. – Bohemian Apr 10 '13 at 00:24
  • 4
    Aha, I found the answer myself about vector being legacy. Thanks, don't plan on using that anymore and you did help me improve the speed of my solution a bit more in comparison to regex+split. Yes, the original question asked for split but it is sometimes helpful to also have alternatives for people who will find this via google etc. Just imagine the difference in time over 1 million or 10 million records to split for this specific case. – Menelaos Apr 10 '13 at 00:49
  • 2
    Well, speed isn't everything. I firmly believe "less code is good" (many reasons - too many to discuss here). But rather than write your own code (if not using `split()`), I would look first to an existing library, and for CSV parsing there are many. – Bohemian Apr 10 '13 at 02:27
  • From years ago when I mucked with regexes in .NET a lot I found I could dramatically improve the performance by keeping a static copy of the regex object (thus pre-parsed into memory as its own parsing tree). No idea how to do the equivalent in java while still using String.split, but that is probably the big performance cost here. – Lisa May 23 '19 at 01:53
10

If your strings are all well-formed it is possible with the following regular expression:

String[] res = str.split(",(?=([^\"]|\"[^\"]*\")*$)");

The expression ensures that a split occurs only at commas which are followed by an even (or zero) number of quotes (and thus not inside such quotes).

Nevertheless, it may be easier to use a simple non-regex parser.

Howard
  • 38,639
  • 9
  • 64
  • 83
  • for reading csv file it is working fine .if you have this type of format 987663,seepzBranch,"Seepz mumbai,andheri","near infra, flat no 23,raghilla mall thane",seepz, – abhishek ringsia Sep 28 '15 at 08:52
-1

While working on csv string we need to know following points.

  1. Every tuple in row will start with either "(quotes) or not. a) If it is starts with "(quotes) then it must be value of a particular column. b) If it is starts directly then it must be header. Ex : 'Header1,Header2,Header3,"value1","value2","value3"'; Here Header1,Header2,Header3 are column names remaining are values.

Main point we need to remember while doing split is you need check that spiting is done properly or not. a) Get the split value and check number of quotes in value (count must be even) b) If count is odd then append next split value. c) Repeat process a,b until quotes are equal.

  • 1
    This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/31081350) – Marc Wrobel Feb 20 '22 at 21:16