3

I need to write a regular expression for string read from a file

apple,boy,cat,"dog,cat","time\" after\"noon"

I need to split it into

apple
boy
cat
dog,cat
time"after"noon

I tried using

Pattern pattern = 
Pattern.compile("[\\\"]");
String items[]=pattern.split(match);

for the second part but I could not get the right answer,can you help me with this?

Captain Ford
  • 350
  • 1
  • 9
user1272855
  • 82
  • 1
  • 8
  • Why do you need to use regex for this? You could replace "\" with an empty space after splitting on the comma? –  Mar 02 '13 at 21:41
  • Try changing your regex to "\\\"" this will help a little but wont get you to your final goal. – Scott Mar 02 '13 at 21:42
  • 2
    A regular expression cannot accomplish what you are trying to do. Consider what will happen if you try to parse this line: `apple,boy,"C:\\","dog,cat"` Instead of using a regular expression, I recommend you simply read the characters one by one and handle backslash-escaping in your own code. – VGR Mar 02 '13 at 21:56
  • A split does not manipulate the substrings in any way, which is what you're asking for. (`\"` becoming `"` in the result.) – Qtax Mar 03 '13 at 03:48
  • 2
    @VGR, a regex can parse (tokenize) this simple regular grammar just fine. For example: `\G([^",]*|"(?:[^"\\]+|\\.)*")(?:,|$)`, which would properly match `"C:\\"` above. – Qtax Mar 03 '13 at 04:43

3 Answers3

3

Since your question is more of a parsing problem than a regex problem, here's another solution that will work:

public class CsvReader {

    Reader r;
    int row, col;
    boolean endOfRow;

    public CsvReader(Reader r){
        this.r = r instanceof BufferedReader ? r : new BufferedReader(r);
        this.row = -1;
        this.col = 0;
        this.endOfRow = true;
    }

    /**
     * Returns the next string in the input stream, or null when no input is left
     * @return
     * @throws IOException  
     */
    public String next() throws IOException {
        int i = r.read();
        if(i == -1)
            return null;

        if(this.endOfRow){
            this.row++;
            this.col = 0;
            this.endOfRow = false;
        } else {
            this.col++;
        }

        StringBuilder b = new StringBuilder();
outerLoop:  
        while(true){
            char c = (char) i;
            if(i == -1)
                break;
            if(c == ','){
                break;
            } else if(c == '\n'){
                endOfRow = true;
                break;
            } else if(c == '\\'){
                i = r.read();
                if(i == -1){
                    break;
                } else {
                    b.append((char)i);
                }
            } else if(c == '"'){
                while(true){
                    i = r.read();

                    if(i == -1){
                        break outerLoop;
                    }
                    c = (char)i;
                    if(c == '\\'){
                        i = r.read();
                        if(i == -1){
                            break outerLoop;
                        } else {
                            b.append((char)i);
                        }
                    } else if(c == '"'){
                        r.mark(2);
                        i = r.read();
                        if(i == '"'){
                            b.append('"');
                        } else {
                            r.reset();
                            break;
                        }
                    } else {
                        b.append(c);
                    }
                }
            } else {
                b.append(c);
            }
            i = r.read();
        }

        return b.toString().trim();
    }


    public int getColNum(){
        return col;
    }

    public int getRowNum(){
        return row;
    }

    public static void main(String[] args){

        try {
            String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"\nquick\"fix\" hello, \"\"\"who's there?\"";
            System.out.println(input);
            Reader r = new StringReader(input);
            CsvReader csv = new CsvReader(r);
            String s;
            while((s = csv.next()) != null){
                System.out.println("R" + csv.getRowNum() + "C" + csv.getColNum() + ": " + s);
            }
        } catch(IOException e){
            e.printStackTrace();
        }
    }
}

Running this code, I get the output:

R0C0: apple
R0C1: boy
R0C2: cat
R0C3: dog,cat
R0C4: time" after"noon
R1C0: quickfix hello
R1C1: "who's there?

This should fit your needs pretty well.

A few disclaimers, though:

  • It won't catch errors in the syntax of the CSV format, such as an unescaped quotation mark in the middle of a value.
  • It won't perform any character conversion (such as converting "\n" to a newline character). Backslashes simply cause the following character to be treated literally, including other backslashes. (That should be easy enough to alter if you need additional functionality)
  • Some csv files escape quotes by doubling them rather than using a backslash, this code now looks for both.

Edit: Looked up the csv format, discovered there's no real standard, but updated my code to catch quotes escaped by doubling rather than backslashes.

Edit 2: Fixed. Should work as advertised now. Also modified it to test the tracking of row and column numbers.

Captain Ford
  • 350
  • 1
  • 9
  • Actually there is a standard: [RFC 4180](http://tools.ietf.org/html/rfc4180). But it specifies the old Microsoft-style quoting, meaning quotes in a value are doubled rather than backslash-escaped. – VGR Mar 03 '13 at 13:52
  • I think this won't work if you have input string,"r\\at,ze\\\"bra,\"dog,cat\",\"animal,ant,fox,house"; I tried doing the same if I have a double quote in the starting of the string it does not close we can not resolve the same. – user1272855 Mar 04 '13 at 00:42
  • Double quote escaping only works inside a pair of quotes. "" is an empty string. """" would resolve to ". I tested every variation I can think of and it resolves the way I would expect. – Captain Ford Mar 05 '13 at 06:18
0

First thing: String.split() uses the regex to find the separators, not the substrings.

Edit: I'm not sure if this can be done with String.split(). I think the only way you could deal with the quotes while only matching the comma would be by readahead and lookbehind, and that's going to break in quite a lot of cases.

Edit2: I'm pretty sure it can be done with a regular expression. And I'm sure this one case could be solved with string.split() -- but a general solution wouldn't be simple.

Basically, you're looking for anything that isn't a comma as input [^,], you can handle quotes as a separate character. I've gotten most of the way there myself. I'm getting this as output:

apple

boy

cat


dog

cat



time\" after\"noon

But I'm not sure why it has so many blank lines.

My complete code is:

String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"";

Pattern pattern =
        Pattern.compile("(\\s|[^,\"\\\\]|(\\\\.)||(\".*\"))*");
Matcher m = pattern.matcher(input);

while(m.find()){
    System.out.println(m.group());
}

But yeah, I'll echo the guy above and say that if there's no requirement to use a regular expression, then it's probably simpler to do it manually.

But then I guess I'm almost there. It's spitting out ... oh hey, I see what's going on here. I think I can fix that.

But I'm going to echo the guy above and say that if there's no requirement to use a regular expression, it's probably better to do it one character at a time and implement the logic manually. If your regex isn't picture-perfect, then it could cause all kinds of unpredictable weirdness down the line.

Captain Ford
  • 350
  • 1
  • 9
  • I'm working with csv files, and lot of data in that....problem is that to have a ,(comma) as part of the string value. they are having in double quotes like for word cat,boy they write it as "cat,boy" and to print" in this case they are escaping it.thanks in advance and thanks for replies and suggestions – user1272855 Mar 02 '13 at 22:00
  • I'm going to add an alternate solution below, since there's no regex requirement. I'm not going to be able to find a good regex for this. Handling the backslashes is turning out to be really difficult. – Captain Ford Mar 02 '13 at 22:19
  • Thanks a lot for the replies, without regexp it worked fine for me. thanks for the help and suggestions :) – user1272855 Mar 02 '13 at 23:12
0

I am not really sure about this but you could have a go at Pattern.compile("[\\\\"]");

\ is an escape character and to detect a \ in the expression, \\\\ could be used.

A similar thing worked for me in another context and I hope it solves your problem too.

Community
  • 1
  • 1
Swayam
  • 16,294
  • 14
  • 64
  • 102
  • I am afraid OPs problem lies deeper then in \ literal in regex. – Pshemo Mar 02 '13 at 23:12
  • Yes, I realize that his problem is more of a parsing error. It would be of little help to him. But I guess it would atleast help him to detect the `\\` in the expression. – Swayam Mar 02 '13 at 23:14