7

I want to split a string with a delimiter white space. but it should handle quoted strings intelligently. E.g. for a string like

"John Smith" Ted Barry 

It should return three strings John Smith, Ted and Barry.

fastcodejava
  • 39,895
  • 28
  • 133
  • 186
  • 2
    You probably need to split out the quote enclosed strings first, then split the rest of the string by whitespace. There must be some questions around here about how to do the first step. The second step is trivial. – jahroy May 22 '12 at 02:50
  • 1
    And what have you tried? – Basilio German May 22 '12 at 02:51
  • 2
    A decent CSV parser library would work well for you. Most will allow selection of delimiter and will respect and avoid splitting quoted text. – Hovercraft Full Of Eels May 22 '12 at 02:51
  • 4
    You will run into trouble when you only have an odd number of quotes. what would you want to do if this happens? – Basilio German May 22 '12 at 02:54
  • 1
    I have a (really) shitty code for this a long time ago. I cannot remember whether it works for everything or not, but it should have gone through quite a lot of bad inputs. I don't have time to clean up the code, so please ignore anything to do with cmdId: http://pastebin.com/aZngu65y – nhahtdh May 22 '12 at 03:08

5 Answers5

10

After messing around with it, you can use Regex for this. Run the equivalent of "match all" on:

((?<=("))[\w ]*(?=("(\s|$))))|((?<!")\w+(?!"))

A Java Example:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Test
{ 
    public static void main(String[] args)
    {
        String someString = "\"Multiple quote test\" not in quotes \"inside quote\" \"A work in progress\"";
        Pattern p = Pattern.compile("((?<=(\"))[\\w ]*(?=(\"(\\s|$))))|((?<!\")\\w+(?!\"))");
        Matcher m = p.matcher(someString);

        while(m.find()) {
            System.out.println("'" + m.group() + "'");
        }
    }
}

Output:

'Multiple quote test'
'not'
'in'
'quotes'
'inside quote'
'A work in progress'

The regular expression breakdown with the example used above can be viewed here:

http://regex101.com/r/wM6yT9


With all that said, regular expressions should not be the go to solution for everything - I was just having fun. This example has a lot of edge cases such as the handling unicode characters, symbols, etc. You would be better off using a tried and true library for this sort of task. Take a look at the other answers before using this one.

Jay
  • 18,959
  • 11
  • 53
  • 72
  • I am not sure if the input contains Unicode or not, but your code will not be able to handle it. – nhahtdh May 22 '12 at 03:26
  • this is a good example. +1, why dont you put an if to check if m.group() returns a blank space, that way you dont have to output the blank spaces. – Basilio German May 22 '12 at 03:48
  • Nope, it does not work properly when there are 2 quoted strings. Unicode problem still persist (and the u flag is Unicode case-sensitive, nothing to do with Unicode matching). – nhahtdh May 22 '12 at 05:28
  • 1
    The (?u) is not necessary, from my understanding of the doc. Instead of \w, check out \p{L}, which will match any Unicode **letter**. – nhahtdh May 23 '12 at 01:01
  • Matt's answer with the Apache commons-lang library is much cleaner and much safer. – Zoltán Dec 03 '13 at 15:13
  • @Zoltán I agree. This is a community wiki answer but I'll clean it up a little and make a note that regex isn't the only solution for those who visit this question. – Jay Dec 03 '13 at 15:16
4

Try this ugly bit of code.

    String str = "hello my dear \"John Smith\" where is Ted Barry";
    List<String> list = Arrays.asList(str.split("\\s"));
    List<String> resultList = new ArrayList<String>();
    StringBuilder builder = new StringBuilder();
    for(String s : list){
        if(s.startsWith("\"")) {
            builder.append(s.substring(1)).append(" ");
        } else {
            resultList.add((s.endsWith("\"") 
                    ? builder.append(s.substring(0, s.length() - 1)) 
                    : builder.append(s)).toString());
            builder.delete(0, builder.length());
        }
    }
    System.out.println(resultList);     
Adeel Ansari
  • 39,541
  • 12
  • 93
  • 133
  • Excessive blank space will cause the program to generate empty strings. – nhahtdh May 22 '12 at 03:58
  • @nhahtdh: O'yeah. I just provided a hint, actually. Not a 100% working solution. Trevor Senior, nailed it down well. That also has a same issue of blank spaces, though. But that's not a real issue and can be fixed easily. – Adeel Ansari May 22 '12 at 04:00
  • His actually has problem with Unicode, and also excessive blank space will generate empty strings. – nhahtdh May 22 '12 at 04:02
  • 1
    +1 learned a bit of regex there in your answer. Fixed my issue with blank space and unicode support - it all came down to a silly regex mistake. `*` vs `+`. – Jay May 22 '12 at 05:15
  • @TrevorSenior: Actually, I don't know why I came up with that stupid regex. Otherwise, only `\\s` would have sufficed. Fixed that already. – Adeel Ansari May 22 '12 at 05:57
  • Ah ok. I was wondering what the `&&` bit did but you removed that as well. – Jay May 22 '12 at 13:37
3

well, i made a small snipet that does what you want and some more things. since you did not specify more conditions i did not go through the trouble. i know this is a dirty way and you can probably get better results with something that is already made. but for the fun of programming here is the example:

    String example = "hello\"John Smith\" Ted Barry lol\"Basi German\"hello";
    int wordQuoteStartIndex=0;
    int wordQuoteEndIndex=0;

    int wordSpaceStartIndex = 0;
    int wordSpaceEndIndex = 0;

    boolean foundQuote = false;
    for(int index=0;index<example.length();index++) {
        if(example.charAt(index)=='\"') {
            if(foundQuote==true) {
                wordQuoteEndIndex=index+1;
                //Print the quoted word
                System.out.println(example.substring(wordQuoteStartIndex, wordQuoteEndIndex));//here you can remove quotes by changing to (wordQuoteStartIndex+1, wordQuoteEndIndex-1)
                foundQuote=false;
                if(index+1<example.length()) {
                    wordSpaceStartIndex = index+1;
                }
            }else {
                wordSpaceEndIndex=index;
                if(wordSpaceStartIndex!=wordSpaceEndIndex) {
                    //print the word in spaces
                    System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
                }
                wordQuoteStartIndex=index;
                foundQuote = true;
            }
        }

        if(foundQuote==false) {
            if(example.charAt(index)==' ') {
                wordSpaceEndIndex = index;
                if(wordSpaceStartIndex!=wordSpaceEndIndex) {
                    //print the word in spaces
                    System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
                }
                wordSpaceStartIndex = index+1;
            }

            if(index==example.length()-1) {
                if(example.charAt(index)!='\"') {
                    //print the word in spaces
                    System.out.println(example.substring(wordSpaceStartIndex, example.length()));
                }
            }
        }
    }

this also checks for words that were not separated with a space after or before the quotes, such as the words "hello" before "John Smith" and after "Basi German".

when the string is modified to "John Smith" Ted Barry the output is three strings, 1) "John Smith" 2) Ted 3) Barry

The string in the example is hello"John Smith" Ted Barry lol"Basi German"hello and prints 1)hello 2)"John Smith" 3)Ted 4)Barry 5)lol 6)"Basi German" 7)hello

Hope it helps

Basilio German
  • 1,801
  • 1
  • 13
  • 22
  • 1
    This is the best code among all these. It can take care of Unicode input and does not generate empty strings when there are excessive spaces. It will keep everything inside quote intact (well, this can be a plus or minus). I think the code can be modified a bit to remove the quotes. Further expansion can be: add support for escaped quote. – nhahtdh May 22 '12 at 04:06
  • Sure, the quotes can be removed. only i made it on pupose to keep the quotes. ive added comments on where to remove the quotes. – Basilio German May 22 '12 at 06:36
1

commons-lang has a StrTokenizer class to do this for you, and there is also java-csv library.

Example with StrTokenizer:

String params = "\"John Smith\" Ted Barry"
// Initialize tokenizer with input string, delimiter character, quote character
StrTokenizer tokenizer = new StrTokenizer(params, ' ', '"');
for (String token : tokenizer.getTokenArray()) {
   System.out.println(token);
}

Output:

John Smith
Ted
Barry
Zoltán
  • 21,321
  • 14
  • 93
  • 134
Matt
  • 11,523
  • 2
  • 23
  • 33
1

This is my own version, clean up from http://pastebin.com/aZngu65y (posted in the comment). It can take care of Unicode. It will clean up all excessive spaces (even in quote) - this can be good or bad depending on the need. No support for escaped quote.

private static String[] parse(String param) {
  String[] output;

  param = param.replaceAll("\"", " \" ").trim();
  String[] fragments = param.split("\\s+");

  int curr = 0;
  boolean matched = fragments[curr].matches("[^\"]*");
  if (matched) curr++;

  for (int i = 1; i < fragments.length; i++) {
    if (!matched)
      fragments[curr] = fragments[curr] + " " + fragments[i];

    if (!fragments[curr].matches("(\"[^\"]*\"|[^\"]*)"))
      matched = false;
    else {
      matched = true;

      if (fragments[curr].matches("\"[^\"]*\""))
        fragments[curr] = fragments[curr].substring(1, fragments[curr].length() - 1).trim();

      if (fragments[curr].length() != 0)
        curr++;

      if (i + 1 < fragments.length)
        fragments[curr] = fragments[i + 1];
    }
  }

  if (matched) { 
    return Arrays.copyOf(fragments, curr);
  }

  return null; // Parameter failure (double-quotes do not match up properly).
}

Sample input for comparison:

"sdfskjf" sdfjkhsd "hfrif ehref" "fksdfj sdkfj fkdsjf" sdf sfssd


asjdhj    sdf ffhj "fdsf   fsdjh"
日本語 中文 "Tiếng Việt" "English"
    dsfsd    
   sdf     " s dfs    fsd f   "  sd f   fs df  fdssf  "日本語 中文"
""   ""     ""
"   sdfsfds "   "f fsdf

(2nd line is empty, 3rd line is spaces, last line is malformed). Please judge with your own expected output, since it may varies, but the baseline is that, the 1st case should return [sdfskjf, sdfjkhsd, hfrif ehref, fksdfj sdkfj fkdsjf, sdf, sfssd].

nhahtdh
  • 55,989
  • 15
  • 126
  • 162