42

I'm processing a string which is tab delimited. I'm accomplishing this using the split function, and it works in most situations. The problem occurs when a field is missing, so instead of getting null in that field I get the next value. I'm storing the parsed values in a string array.

String[] columnDetail = new String[11];
columnDetail = column.split("\t");

Any help would be appreciated. If possible I'd like to store the parsed strings into a string array so that I can easily access the parsed data.

Cœur
  • 37,241
  • 25
  • 195
  • 267
lakhaman
  • 485
  • 2
  • 5
  • 8
  • So `field1\tfield2\t\tfield4` gives you field1,field2,field4 instead of field1,field2,[null],field4 ? – o.k.w Oct 28 '09 at 08:08
  • 3
    http://stackoverflow.com/questions/1630092/token-parsing-in-java/1630110 duplicate? This is what happens when you DON'T understand the answers and just copy the code. – Filip Ekberg Oct 28 '09 at 08:10
  • 2
    You don't need to allocate a new string array. `String.split` allocates a new one anyway. – Joey Oct 28 '09 at 08:10
  • ?o.k.w ya actually i have xml file which contains tag and i have to read its tab seperated value. – lakhaman Oct 28 '09 at 08:13
  • You need to understand What you are looking for and Why. Giving you working-code for your problem wont teach you anything, you will just end up asking the same question over and over again in different scenarios. – Filip Ekberg Oct 28 '09 at 10:56

7 Answers7

92

String.split uses Regular Expressions, also you don't need to allocate an extra array for your split.

The split-method will give you a list., the problem is that you try to pre-define how many occurrences you have of a tab, but how would you Really know that? Try using the Scanner or StringTokenizer and just learn how splitting strings work.

Let me explain Why \t does not work and why you need \\\\ to escape \\.

Okay, so when you use Split, it actually takes a regex ( Regular Expression ) and in regular expression you want to define what Character to split by, and if you write \t that actually doesn't mean \t and what you WANT to split by is \t, right? So, by just writing \t you tell your regex-processor that "Hey split by the character that is escaped t" NOT "Hey split by all characters looking like \t". Notice the difference? Using \ means to escape something. And \ in regex means something Totally different than what you think.

So this is why you need to use this Solution:

\\t

To tell the regex processor to look for \t. Okay, so why would you need two of em? Well, the first \ escapes the second, which means it will look like this: \t when you are processing the text!

Now let's say that you are looking to split \

Well then you would be left with \\ but see, that doesn't Work! because \ will try to escape the previous char! That is why you want the Output to be \\ and therefore you need to have \\\\.

I really hope the examples above helps you understand why your solution doesn't work and how to conquer other ones!

Now, I've given you this answer before, maybe you should start looking at them now.

OTHER METHODS

StringTokenizer

You should look into the StringTokenizer, it's a very handy tool for this type of work.

Example

 StringTokenizer st = new StringTokenizer("this is a test");
 while (st.hasMoreTokens()) {
     System.out.println(st.nextToken());
 }

This will output

 this
 is
 a
 test

You use the Second Constructor for StringTokenizer to set the delimiter:

StringTokenizer(String str, String delim)

Scanner

You could also use a Scanner as one of the commentators said this could look somewhat like this

Example

 String input = "1 fish 2 fish red fish blue fish";

 Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*");

 System.out.println(s.nextInt());
 System.out.println(s.nextInt());
 System.out.println(s.next());
 System.out.println(s.next());

 s.close(); 

The output would be

 1
 2
 red
 blue 

Meaning that it will cut out the word "fish" and give you the rest, using "fish" as the delimiter.

examples taken from the Java API

Community
  • 1
  • 1
Filip Ekberg
  • 36,033
  • 20
  • 126
  • 183
  • 3
    Regular expressions shouldn't bite you when splitting at tab, though. – Joey Oct 28 '09 at 08:11
  • 1
    Probably not, but if the OP just would Try to read answers and understand them, he would already know the answer to this. Because this is simmilar to what he posted yesterday. I would say that IF he used my method yesterday and today, he wouldn't have gotten this problem. – Filip Ekberg Oct 28 '09 at 08:13
  • I've added some more to clearify why it doesn't work to split by \t. hth. – Filip Ekberg Oct 28 '09 at 08:25
  • @Filip i have to parse xml file which has commen header field and then multiple data fields so if i use stringtokenizer then i can't determined that which field is null. yesterday i have raised problem for text file while today it for XML file. that's why i must have to use split function – lakhaman Oct 28 '09 at 10:15
  • 1
    You are looking on the problem totaly wrong or you are asking the wrong type of question. I would suggest that instead of involving parsers and stuff to read the XML. Just start simple. Please provide us with an Example and if there is no way for you to use the information provided by me ( which i find doubtfull ), well then theres not much i can do for you. – Filip Ekberg Oct 28 '09 at 10:55
  • 6
    The output is the same, if you use "\t" or "\\t", and I'm not sure why you went into using StringTokenizer and Scanner. Also, String.split is a lot simpler than the other two and per documentation "StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code." – Turing Sep 24 '11 at 15:36
  • 2
    -1 - wrong info on "\t" or "\\t" (http://stackoverflow.com/a/3762377/281545) - please edit – Mr_and_Mrs_D Dec 05 '12 at 11:47
  • @Mr_and_Mrs_D, Care to be a bit more specific? – Filip Ekberg Dec 05 '12 at 21:00
  • @Mr_and_Mrs_D, Can you be more specific please? Maybe something changed since I first answered the question since it's been 3 years. So please, be more specific. The answer you linked doesn't directly indicate that my answer here is wrong. – Filip Ekberg Dec 06 '12 at 17:18
  • I was talking about the part of your answer that says that there is a difference between `\t` and `\\t` in `split` - sorry if I was unclear :) – Mr_and_Mrs_D Dec 06 '12 at 18:15
  • @Mr_and_Mrs_D, Haven't done Java in a while and I don't have an IDE available so I'll just have to take your word for it. It seems it was the correct answer at the time though. Feel free to edit the answer if you're certain that it's an error/out dated information in the answer – Filip Ekberg Dec 06 '12 at 19:18
  • they wont let me edit (just spelling edits) - just delete the part from `Let me explain Why \t does not work` up to `conquer other ones!` – Mr_and_Mrs_D Dec 06 '12 at 22:26
  • @Mr_and_Mrs_D, I just find it weird that it got 27 upvotes when I originally answered it. I would need to do some exploring before just removing a big portion of the answer. Anyways, the people coming here will now see the comments as well. – Filip Ekberg Dec 07 '12 at 06:48
  • I wonder if you thought that OP was trying to split on the String "\t" (a backspace followed by 't'), rather then the tab character. If "no", then the first section is wrong and I wonder if it ever were true. You don't need to apply double escapes for the tab character, a single one is fine. The regex itself doesn't need to have access to the String `\t` (which would explain the need for `\\t`), the actual tab char (after `\t` has been replaced by its corresponding char (byte 9)). Letting the regex handle `\t`, thus providing two backslashes, works as well, but is not required. – Tom Jan 12 '18 at 10:06
24

Try this:

String[] columnDetail = column.split("\t", -1);

Read the Javadoc on String.split(java.lang.String, int) for an explanation about the limit parameter of split function:

split

public String[] split(String regex, int limit)
Splits this string around matches of the given regular expression.
The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

The string "boo:and:foo", for example, yields the following results with these parameters:

Regex   Limit   Result
:   2   { "boo", "and:foo" }
:   5   { "boo", "and", "foo" }
:   -2  { "boo", "and", "foo" }
o   5   { "b", "", ":and:f", "", "" }
o   -2  { "b", "", ":and:f", "", "" }
o   0   { "b", "", ":and:f" }

When the last few fields (I guest that's your situation) are missing, you will get the column like this:

field1\tfield2\tfield3\t\t

If no limit is set to split(), the limit is 0, which will lead to that "trailing empty strings will be discarded". So you can just get just 3 fields, {"field1", "field2", "field3"}.

When limit is set to -1, a non-positive value, trailing empty strings will not be discarded. So you can get 5 fields with the last two being empty string, {"field1", "field2", "field3", "", ""}.

Parker
  • 7,244
  • 12
  • 70
  • 92
Happy3
  • 299
  • 2
  • 5
  • @Happy3: you gave link to java1.4 doc. shouldn't we refer to more latest version? :) – nir Feb 18 '15 at 22:19
7

Well nobody answered - which is in part the fault of the question : the input string contains eleven fields (this much can be inferred) but how many tabs ? Most possibly exactly 10. Then the answer is

String s = "\t2\t\t4\t5\t6\t\t8\t\t10\t";
String[] fields = s.split("\t", -1);  // in your case s.split("\t", 11) might also do
for (int i = 0; i < fields.length; ++i) {
    if ("".equals(fields[i])) fields[i] = null;
}
System.out.println(Arrays.asList(fields));
// [null, 2, null, 4, 5, 6, null, 8, null, 10, null]
// with s.split("\t") : [null, 2, null, 4, 5, 6, null, 8, null, 10]

If the fields happen to contain tabs this won't work as expected, of course.
The -1 means : apply the pattern as many times as needed - so trailing fields (the 11th) will be preserved (as empty strings ("") if absent, which need to be turned to null explicitly).

If on the other hand there are no tabs for the missing fields - so "5\t6" is a valid input string containing the fields 5,6 only - there is no way to get the fields[] via split.

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
5

String.split implementations will have serious limitations if the data in a tab-delimited field itself contains newline, tab and possibly " characters.

TAB-delimited formats have been around for donkey's years, but format is not standardised and varies. Many implementations don't escape characters (newlines and tabs) appearing within a field. Rather, they follow CSV conventions and wrap any non-trivial fields in "double quotes". Then they escape only double-quotes. So a "line" could extend over multiple lines.

Reading around I heard "just reuse apache tools", which sounds like good advice.

In the end I personally chose opencsv. I found it light-weight, and since it provides options for escape and quote characters it should cover most popular comma- and tab- delimited data formats.

Example:

CSVReader tabFormatReader = new CSVReader(new FileReader("yourfile.tsv"), '\t');
Luke Usherwood
  • 3,082
  • 1
  • 28
  • 35
2

You can use yourstring.split("\x09"); I tested it, and it works.

RickeyShao
  • 65
  • 1
  • 1
  • 7
1

I just had the same question and noticed the answer in some kind of tutorial. In general you need to use the second form of the split method, using the

split(regex, limit)

Here is the full tutorial http://www.rgagnon.com/javadetails/java-0438.html

If you set some negative number for the limit parameter you will get empty strings in the array where the actual values are missing. To use this your initial string should have two copies of the delimiter i.e. you should have \t\t where the values are missing.

Hope this helps :)

Ivan Marinov
  • 191
  • 6
0
String[] columnDetail = new String[11];
columnDetail = column.split("\t", -1); // unlimited
OR
columnDetail = column.split("\t", 11); // if you are sure about limit.
 * The {@code limit} parameter controls the number of times the
 * pattern is applied and therefore affects the length of the resulting
 * array.  If the limit <i>n</i> is greater than zero then the pattern
 * will be applied at most <i>n</i>&nbsp;-&nbsp;1 times, the array's
 * length will be no greater than <i>n</i>, and the array's last entry
 * will contain all input beyond the last matched delimiter.  If <i>n</i>
 * is non-positive then the pattern will be applied as many times as
 * possible and the array can have any length.  If <i>n</i> is zero then
 * the pattern will be applied as many times as possible, the array can
 * have any length, and trailing empty strings will be discarded.