0

I have an instruction like:

db.insert( {
    _id:3,
    cost:{_0:11},
    description:"This is a description.\nCool, isn\'t it?"
});

The Eclipse plugin I am using, called MonjaDB splits the instruction by newline and I get each line as a separate instruction, which is bad. I fixed it using ;(\r|\n)+ which now includes the entire instruction, however, when sanitizing the newlines between the parts of the JSON, it also sanitizes the \n and \r within string in the json itself.

How do I avoid removing \t, \r, \n from within JSON strings? which are, of course, delimited by "" or ''.

Discipol
  • 3,137
  • 4
  • 22
  • 41
  • You have several layers of difficulty: **1-** you want to split instead of matching **2-** JAVA supports only finite lookbehinds **3-** What if there were escaped double or single quotes in your strings ? You're basically doomed IMO. You need a proper parser. – HamZa Aug 13 '13 at 08:11
  • @HamZa *not* replacing something inside strings is actually a bit simpler and doesn't really need unbounded lookbehind. You just match both whitespace and strings like `\s|(stringRegex)` and replace with `$1`. – Martin Ender Aug 13 '13 at 08:17
  • @m.buettner But he's splitting, not replacing. – HamZa Aug 13 '13 at 08:19
  • @HamZa to me it sounded like the sanitisation is a separate process from the splitting. – Martin Ender Aug 13 '13 at 08:30
  • Yes, first step is splitting the DB commands, I can guarantee a ; and at least one newline between them. Hey, could I replace the stuff in the json string with #@$, then split \r\n, then revert the #@$ back! But how do I detect the \r \n \t within the json strings? – Discipol Aug 13 '13 at 08:30

1 Answers1

3

You need to arrange to ignore whitespace when it appears within quotes,. So as suggested by one of the commenters:

\s+ | ( "  (?: [^"\\]  |  \\ . ) * " )              // White-space inserted for readability

Match java whitespace or a double-quoted string where a string consists of " followed by any non-escape, non-quote or an escape + plus any character, then a final ". This way, whitespaces inside strings are not matched.

and replace with $1 if $1 is not null.

    Pattern clean = Pattern.compile(" \\s+ | ( \" (?: [^\"\\\\] | \\\\ . ) * \" ) ", Pattern.COMMENTS | Pattern.DOTALL);

StringBuffer sb = new StringBuffer();
Matcher m = clean.matcher( json );
while (m.find()) {
    m.appendReplacement(sb, "" );
    // Don't put m.group(1) in the appendReplacement because if it happens to contain $1 or $2 you'll get an error.
    if ( m.group(1) != null )
        sb.append( m.group(1) );
}
m.appendTail(sb);

String cleanJson = sb.toString();

This is totally off the top of my head but I'm pretty sure it's close to what you want.

Edit: I've just got access to a Java IDE and tried out my solution. I had made a couple of mistakes with my code including using \. instead of . in the Pattern. So I have fixed that up and run it on a variation of your sample:

db.insert( {
    _id:3,
    cost:{_0:11},
    description:"This is a \"description\" with an embedded newline: \"\n\".\nCool, isn\'t it?"
});

The code:

    String json = "db.insert( {\n" +
            "    _id:3,\n" +
            "    cost:{_0:11},\n" +
            "    description:\"This is a \\\"description\\\" with an embedded newline: \\\"\\n\\\".\\nCool, isn\\'t it?\"\n" +
            "});";

        // insert above code

        System.out.println(cleanJson);

This produces:

db.insert({_id:3,cost:{_0:11},description:"This is a \"description\" with an embedded newline: \"\n\".\nCool, isn\'t it?"});

which is the same json expression with all whitespace removed outside quoted strings and whitespace and newlines retained inside quoted strings.

Adrian Pronk
  • 13,486
  • 7
  • 36
  • 60