27

I have csv that comes with format:

a1, a2, a3, "a4,a5", a6

Only field with , will have quotes

Using Java, how to easily parse this? I try to avoid using open source CSV parser as company policy. Thanks.

superfell
  • 18,780
  • 4
  • 59
  • 81
HP.
  • 19,226
  • 53
  • 154
  • 253
  • No idea about easily, CSV has a few fiddly edge cases: escaped quotes – using several styles no less; and newlines in field values – fun if you have to report errors with the CSV line they occured on. If you can't use an existing parser and might have to deal with these, write a parser. (Which is also fun to do if you're not allowed a parser generator.) – millimoose Oct 17 '11 at 22:59
  • 2
    if the company asks for no open source libs {regardless the license) and you need help w/ a simple parse... – bestsss Oct 17 '11 at 23:35
  • @Inerdia, the parsing is around 30lines of hand written code, no need for generator. – bestsss Oct 17 '11 at 23:37
  • possible duplicate of [Parsing CSV in java](http://stackoverflow.com/questions/3908012/parsing-csv-in-java) – Raedwald Apr 03 '14 at 13:18

6 Answers6

25

You could use Matcher.find with the following regular expression:

\s*("[^"]*"|[^,]*)\s*

Here's a more complete example:

String s = "a1, a2, a3, \"a4,a5\", a6";
Pattern pattern = Pattern.compile("\\s*(\"[^\"]*\"|[^,]*)\\s*");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
    System.out.println(matcher.group(1));
}

See it working online: ideone

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • More generally, in a CSV file a value is enclosed in quotes as soon as it contains the separator, a newline and/or quotes… – mousio Oct 17 '11 at 23:10
  • @Mark, double quotes ("") is used to represent a single ". Besides, using regExp is beyond overkill – bestsss Oct 17 '11 at 23:33
  • 3
    This does not work well because it adds empty string in between elements and this creates a problem if there are empty cells in the csv. – Random42 Feb 27 '14 at 12:26
  • 4
    This is a better answer (doesn't add empty strings): http://stackoverflow.com/a/15739087/1068385 – juhoautio Mar 10 '14 at 10:00
4

I came across this same problem (but in Python), one way I found to solve it, without regexes, was: When you get the line, check for any quotes, if there are quotes, split the string on quotes, and split the even indexed results of the resulting array on commas. The odd indexed strings should be the full quoted values.

I'm no Java coder, so take this as pseudocode...

line = String[];
    if ('"' in row){
        vals = row.split('"');
        for (int i =0; i<vals.length();i+=2){
            line+=vals[i].split(',');
        }
        for (int j=1; j<vals.length();j+=2){
            line+=vals[j];
        }
    }
    else{
        line = row.split(',')
    }

Alternatively, use a regex.

K4KYA
  • 147
  • 6
  • I modified this a bit because I need to maintain the result order, but the idea of splitting on the double-quotes and using the index to determine whether it needs to be further split works nicely, and also nice not dealing with a RegEx. – James Toomey Feb 27 '23 at 18:31
3

Here is some code for you, I hope using code out of here doesn't count open source, which is.

package bestsss.util;

import java.io.BufferedReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class SplitCSVLine {
    public static String[] splitCSV(BufferedReader reader) throws IOException{
        return splitCSV(reader, null, ',', '"');
    }

    /**
     * 
     * @param reader - some line enabled reader, we lazy
     * @param expectedColumns - convenient int[1] to return the expected
     * @param separator - the C(omma) SV (or alternative like semi-colon) 
     * @param quote - double quote char ('"') or alternative
     * @return String[] containing the field
     * @throws IOException
     */
    public static String[] splitCSV(BufferedReader reader, int[] expectedColumns, char separator, char quote) throws IOException{       
        final List<String> tokens = new ArrayList<String>(expectedColumns==null?8:expectedColumns[0]);
        final StringBuilder sb = new StringBuilder(24);

        for(boolean quoted=false;;sb.append('\n')) {//lazy, we do not preserve the original new line, but meh
            final String line = reader.readLine();
            if (line==null)
                break;
            for (int i = 0, len= line.length(); i < len; i++) { 
                final char c = line.charAt(i);
                if (c == quote) {
                    if( quoted   && i<len-1 && line.charAt(i+1) == quote ){//2xdouble quote in quoted 
                        sb.append(c);
                        i++;//skip it
                    }else{
                        if (quoted){
                            //next symbol must be either separator or eol according to RFC 4180
                            if (i==len-1 || line.charAt(i+1) == separator){
                                quoted = false;
                                continue;
                            }
                        } else{//not quoted
                            if (sb.length()==0){//at the very start
                                quoted=true;
                                continue;
                            }
                        }
                        //if fall here, bogus, just add the quote and move on; or throw exception if you like to
                        /*
                        5.  Each field may or may not be enclosed in double quotes (however
                           some programs, such as Microsoft Excel, do not use double quotes
                           at all).  If fields are not enclosed with double quotes, then
                           double quotes may not appear inside the fields.
                      */ 
                        sb.append(c);                   
                    }
                } else if (c == separator && !quoted) {
                    tokens.add(sb.toString());
                    sb.setLength(0); 
                } else {
                    sb.append(c);
                }
            }
            if (!quoted)
                break;      
        }
        tokens.add(sb.toString());//add last
        if (expectedColumns !=null)
            expectedColumns[0] = tokens.size();
        return tokens.toArray(new String[tokens.size()]);
    }
    public static void main(String[] args) throws Throwable{
        java.io.StringReader r = new java.io.StringReader("222,\"\"\"zzzz\", abc\"\" ,   111   ,\"1\n2\n3\n\"");
        System.out.println(java.util.Arrays.toString(splitCSV(new BufferedReader(r))));
    }
}
bestsss
  • 11,796
  • 3
  • 53
  • 63
1

The below code seems to work well and can handle quotes within quotes.

final static Pattern quote = Pattern.compile("^\\s*\"((?:[^\"]|(?:\"\"))*?)\"\\s*,");

public static List<String> parseCsv(String line) throws Exception
{       
    List<String> list = new ArrayList<String>();
    line += ",";

    for (int x = 0; x < line.length(); x++)
    {
        String s = line.substring(x);
        if (s.trim().startsWith("\""))
        {
            Matcher m = quote.matcher(s);
            if (!m.find())
                throw new Exception("CSV is malformed");
            list.add(m.group(1).replace("\"\"", "\""));
            x += m.end() - 1;
        }
        else
        {
            int y = s.indexOf(",");
            if (y == -1)
                throw new Exception("CSV is malformed");
            list.add(s.substring(0, y));
            x += y;
        }
    }
    return list;
}
craigrs84
  • 3,048
  • 1
  • 27
  • 34
0

Here is my solution, in Python. It can take care of single level quotes.

def parserow(line):
    ''' this splits the input line on commas ',' but allowing commas within fields
    if they are within double quotes '"'
    example:
        fieldname1,fieldname2,fieldname3
        field value1,"field, value2, allowing, commas", field value3
    gives:
        ['field value1','"field, value2, allowing, commas"', ' field value3']
    '''
    out = []
    current_field = ''
    within_quote = False
    for c in line:
        if c == '"':
            within_quote = not within_quote
        if c == ',':
            if not within_quote:
                out.append(current_field)
                current_field = ''
                continue
        current_field += c
    if len(current_field) != 0:
        out.append(current_field)
    return out
nakhodkin
  • 1,327
  • 1
  • 17
  • 27
0
public static void main(String[] args) {
    
    final StringBuilder sb = new StringBuilder(240000);
    String s = "";
    boolean start = false;
    boolean ending = false;
    boolean nestedQuote = false;
    boolean nestedComma = false;
    
    char previous = 0 ;
    
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
       if(!start &&c=='"' && previous == ',' && !nestedQuote ) {
            System.out.println("sarted");
            sb.append(c); 
            start = true;
            previous = c;
            System.out.println(sb);
            continue;
        }
       
       if(start && c==',' && previous == '"')  
       {
        nestedQuote = false;
        System.out.println("ended");
        sb.append(c); 
        previous = c;
        System.out.println(sb);
        start = false;
        ending = true;
        continue;
       }
     
       if(start  && c== ',' && previous!='\"'&& !nestedQuote) 
       {
           previous = c;
           sb.append(';'); 
           continue;
       }
           
       
       if(start && ending && c== '"') 
       {
           nestedQuote = true;
           sb.append(c); 
           previous = c;
           continue;
       }
       if(start && c== '"' && nestedQuote) 
       {
           nestedQuote = false;
           previous = c;
           continue;
       }
       
       if(start && c==',' && nestedQuote) 
       {
           nestedComma = true;
           sb.append(';'); 
           previous = c;
           continue;
       }
       
       if(start &&c==',' && nestedQuote && nestedComma) 
       {
           nestedComma = false;
           previous = c;
           continue;
       }
       
        sb.append(c);  
        previous = c;
        
    }
    System.out.println(sb.toString().replaceAll("\"", ""));
}
  • Generic sloution for CSV nested quotes and commas – sandeep Apr 07 '23 at 13:51
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 12 '23 at 12:28