Java parsing a string with lots of whitespace

Question

I have a string with multiple spaces, but when I use the tokenizer it breaks it apart at all of those spaces. I need the tokens to contain those spaces. How can I utilize the StringTokenizer to return the values with the tokens I am splitting on?

You should be find if you're not using space-delimited data. If you are, good luck! Btw, it'd help if you gave us an example. — Edwin, Feb 14 '12 at 21:33
Please give an example of the string you're trying to tokenize and how you want the result to look. — matt freake, Feb 14 '12 at 21:37

Brian Roach · Answer 1 · 2012-02-14T22:35:18.007

2

You'll note in the docs for the StringTokenizer that it is recommended it shouldn't be used for any new code, and that String.split(regex) is what you want

String foo = "this is      some  data      in   a string";
String[] bar = foo.split("\\s+");

Edit to add: Or, if you have greater needs than a simple split, then use the Pattern and Matcher classes for more complex regular expression matching and extracting.

Edit again: If you want to preserve your space, actually knowing a bit about regular expressions really helps:

String[] bar = foo.split("\\b+");

This will split on word boundaries, preserving the space between each word as a String;

public static void main( String[] args )
{
    String foo = "this is      some  data      in   a string";
    String[] bar = foo.split("\\b");
    for (String s : bar)
    {
        System.out.print(s);
        if (s.matches("^\\s+$"))
        {
            System.out.println("\t<< " + s.length() + " spaces");
        }
        else
        {
            System.out.println();
        }
    }
}

Output:

this
        << 1 spaces
is
        << 6 spaces
some
        << 2 spaces
data
        << 6 spaces
in
        << 3 spaces
a
        << 1 spaces
string

edited Feb 14 '12 at 22:35

answered Feb 14 '12 at 21:37

Brian Roach

76,169
12
136
161

1

This splits the string, but does *not* preserve whitespace. – Travis J Feb 14 '12 at 21:56
@TravisJ - the OP's question does not provide enough detail to provide a precise solution for his problem; I have no idea if he wants N strings with some of them being all the space between the words, or if he has "empty" columns represented by some amount of the space between words, etc. Also, see section marked "edited to add". – Brian Roach Feb 14 '12 at 21:59
1

If you cannot post an answer then perhaps you should abstain. I will provide a proper regex solution in an edited section. – Travis J Feb 14 '12 at 22:03
@TravisJ - Oh no, thank you; you encouraged me to provide the OP with an answer that was actually efficient and correct if that was his actual need. – Brian Roach Feb 16 '12 at 07:51
@Brain Roach - You may want to use efficient, and moreover correct, with more caution here. Using `\b` to separate the string on boundaries can have unintended affects when there are non characters present such as periods, dollar signs, accented letters, apostrophes, etc. Putting all these back together with logic would be very inefficient. – Travis J Feb 16 '12 at 22:22

score 1 · Answer 2 · edited Sep 17 '14 at 08:33

1

I think It will be good if you use first replaceAll function to replace all the multiple spaces by a single space and then do tokenization using split function.

edited Sep 17 '14 at 08:33

Regent

5,142
3
21
35

answered Nov 04 '12 at 17:34

Sangeeta

589
1
7
26

score 1 · Answer 3 · answered Feb 14 '12 at 21:33

1

Sounds like you may need to use regular expressions (http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/package-summary.html) instead of StringTokenizer.

answered Feb 14 '12 at 21:33

Jim Kiley

3,632
3
26
43

score 1 · Answer 4 · 2012-02-14T21:58:41.943

Use String.split("\\s+") instead of StringTokenizer.

Note that this will only extract the non-whitespace characters separated by at least one whitespace character, if you want leading/trailing whitespace characters included with the non-whitespace characters that will be a completely different solution!

This requirement isn't clear from your original question, and there is an edit pending that tries to clarify it.

StringTokenizer in almost every non-contrived case is the wrong tool for the job.

Java parsing a string with lots of whitespace

4 Answers4

Linked