1

I see that there are several similar questions, but I have not found any of the answers satisfactory. I have a comma delimited file where each line looks something like this:

4477,52544,,,P,S,    ,,SUSAN JONES,9534 Black Bear Dr,,"CITY, NV 89506",9534 BLACK BEAR DR,,CITY,NV,89506,2008,,,,  ,     ,    , ,,1

The problem that comes into play is when a token escapes a comma with quotes "CITY, NV 89506"

I need a result where the escaped tokens are handled and every token is included, even empty ones .

Dale K
  • 25,246
  • 15
  • 42
  • 71
springcorn
  • 611
  • 2
  • 15
  • 28
  • I'd be tempted to do the parsing myself. It doesn't seem too difficult. – Ted Hopp Oct 04 '12 at 23:24
  • Splitting on this regex, which i found on another question, gets me pretty close regex = ",(?=([^"]*"[^"]*")*[^"]*$)". The problem is i end up with quotes in the results. I can't figure out how to remove the quotes. – springcorn Oct 05 '12 at 00:24
  • This question has been asked many times, actually. The keywords are quite variant though. See http://stackoverflow.com/questions/6432408/regular-expression-to-match-csv-delimiters and http://stackoverflow.com/questions/6428053/validation-using-regular-expression-in-c-net-applicaiton for example. – Andrew Cheong Oct 05 '12 at 03:03
  • How do you want to treat spaces next to delimiter commas? For instance, is the seventh vallue in your example the empty string or a string of 4 spaces? – Ted Hopp Oct 05 '12 at 04:20
  • Ted- Spaces or empty strings should still appear as values. That is part of the challenge here. – springcorn Oct 05 '12 at 17:47

2 Answers2

2

Consider a proper CSV parser such as opencsv. It will be highly tested (unlike a new, home-grown solution) and handle edge-conditions such as the one you describe (and lots you haven't thought about).

In the download, there is an examples folder which contains "addresses.csv" with this line:

Jim Sample,"3 Sample Street, Sampleville, Australia. 2615",jim@sample.com

In the same directory, the file AddressExample.java parses this file, and is highly relevant to your question.

Michael Easter
  • 23,733
  • 7
  • 76
  • 107
0

Here is one way to answer your question using delivered java.lang.String methods. I believe it does what you need.

private final char QUOTE = '"';
private final char COMMA = ',';
private final char SUB = 0x001A; // or whatever character you know will NEVER
    // appear in the input String

public void readLine(String line) {
    System.out.println("original: " + line);

    // Replace commas inside quoted text with substitute character
    boolean quote = false;
        for (int index = 0; index < line.length(); index++) {
        char ch = line.charAt(index);
        if (ch == QUOTE) {
            quote = !quote;
        } else if (ch == COMMA && quote) {
            line = replaceChar(line, index, SUB);
            System.out.println("replaced: " + line);
        }
    }

    // Strip out all quotation marks
    for (int index = 0; index < line.length(); index++) {
        if (line.charAt(index) == QUOTE) {
            line = removeChar(line, index);
            System.out.println("stripped: " + line);
        }
    }

    // Parse input into tokens
    String[] tokens = line.split(",");
    // restore commas in place of SUB characters
    for (int i = 0; i < tokens.length; i++) {
        tokens[i] = tokens[i].replace(SUB, COMMA);
    }

    // Display final results
    System.out.println("Final Parsed Tokens: ");
    for (String token : tokens) {
        System.out.println("[" + token + "]");
    }
}

private String replaceChar(String input, int position, char replacement) {
    String begin = input.substring(0, position);
    String end = input.substring(position + 1, input.length());
    return begin + replacement + end;
}

private String removeChar(String input, int position) {
    String begin = input.substring(0, position);
    String end = input.substring(position + 1, input.length());
    return begin + end;
}
Dale K
  • 25,246
  • 15
  • 42
  • 71