1

i have a file with lines like:

string1 (tab) sting2 (tab) string3 (tab) string4

I want to get from every line, string3... All i now from the lines is that string3 is between the second and the third tab character. is it possible to take it with a pattern like

Pattern pat = Pattern.compile(".\t.\t.\t.");
tequilaras
  • 277
  • 3
  • 7
  • 15
  • 1
    What about tokenizer? You can read the line and then use a tokenizer to parse it from the tabs. Like StringTokenizer line = new StringTokenizer(myString, "\t"); line.nextToken() – Matt Nov 22 '11 at 12:44

3 Answers3

6
String string3 = tempValue.split("\\t")[2];
erimerturk
  • 4,230
  • 25
  • 25
  • 1
    I would *strongly* advise you to explicitly check the result of `split` rather than just blindly assuming there are at least three values. If you're definitely expecting *exactly* four values, I'd probably still want to see an error if there were more than four, and if there were fewer than three I'd definitely prefer an explicit exception which included information such as the line in question rather than just `ArrayIndexOutOfBoundsException`. – Jon Skeet Nov 22 '11 at 12:58
  • 1
    i know this potential exception but if he sure about the input of course he can use it. – erimerturk Nov 22 '11 at 13:02
5

It sounds like you just want:

foreach (String line in lines) {
    String[] bits = line.split("\t");
    if (bits.length != 4) {
        // Handle appropriately, probably throwing an exception
        // or at least logging and then ignoring the line (using a continue
        // statement)
    }
    String third = bits[2];
    // Use...
}

(You can escape the string so that the regex engine has to parse the backslash-t as tab, but you don't have to. The above works fine.)

Another alternative to the built-in String.split method using a regex is the Guava Splitter class. Probably not necessary here, but worth being aware of.

EDIT: As noted in comments, if you're going to repeatedly use the same pattern, it's more efficient to compile a single Pattern and use Pattern.split:

private static final Pattern TAB_SPLITTER = Pattern.compile("\t");

...

String[] bits = TAB_SPLITTER.split(line);
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • You should first look at what Java's built in libraries have to offer before going so low lovel on your own. Java's stuff is more optimised and more readable. In this case you have at least 3 or 4 one-liners that you can use to achieve this. – Shivan Dragon Nov 22 '11 at 12:53
  • 1
    @AndreiBodnarescu: In what way have I gone "so low level"? I've used `String.split` just like everyone else - the difference is that I've explicitly shown handling multiple lines *and* I've validated that there are the expected number of parts rather than just blindly taking the third element. The explicit validation allows for a more useful exception than `ArrayIndexOutOfBoundsException` - for example, it could include the line. Out of interest, did you check what my code was actually doing, or just assume that because it wasn't a single line, it was useless? – Jon Skeet Nov 22 '11 at 12:55
  • Well, ok, I feel you've kindda ganged up on me there. The way I see it the person asking this question already knows about arrays sizes and what happends when you access an index outside of the array, he also knows about regex and how it works in Java, the only thing he was missing was the existance of the split(String regex) method on the String class, so I thought simply pointing that out would be ok. I also feel that in the (never demanded, in my opinion) extra code you've added you go too into detail to assert some things. Again, it's just my opinion. – Shivan Dragon Nov 22 '11 at 13:10
  • @AndreiBodnarescu: But my code is certainly not more "low level" than yours, and it clearly *does* use "what Java's built in libraries have to offer" - do you still believe your first comment stands, having looked at the code? And no, I haven't "ganged up" with anyone. Personally I don't think that validating data is going into too much detail - it's an important point which can easily be missed out if someone just follows answers which do it all in one statement with no explanation, warning, suggestions etc. – Jon Skeet Nov 22 '11 at 13:12
  • First off, I've already explained why I find your answer too long and too in-detail/low-level in accordance to the question asked. If you don't agree with my comment/answer, feel free to flag me, that's why the button is there. Second, it's not you I was reffering to when I was saying "ganged-up" (unless you're a gang of one :) ), its Erick Robertson who seems to think it's his duty to keep the person asking away from the answer that person deems useful. – Shivan Dragon Nov 22 '11 at 13:17
  • 2
    @AndreiBodnarescu: Two people acting independently aren't really a gang, are they? And no, you still haven't explained why you feel that I should "first look at what Java's built in libraries have to offer" when I'm using the same method you are (`split`). What exactly were you suggesting I should look for that I wasn't obviously aware of? Which bit of Java is "optimized and more readable"? Your first comment *looks* like you hadn't spotted that I was already using `split`. If you're saying you *had* read my answer properly, it's a very confusing comment which I simply don't understand. – Jon Skeet Nov 22 '11 at 13:19
  • 1
    If you are executing the split multiple times (e.g. parsing a file line by line in a loop), then you should create a Pattern once and then use [Pattern.split](http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#split%28java.lang.CharSequence%29) rather than [String.split](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29). – ewan.chalmers Nov 23 '11 at 10:45
3

If you want a regex which captures the third field only and nothing else, you could use the following:

String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
Pattern pattern = Pattern.compile(regex);

Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
  System.err.println(matcher.group(1));
}

I don't know whether this would perform any better than split("\\t") for parsing a large file.

UPDATE

I was curious to see how the simple split versus the more explicit regex would perform, so I tested three different parser implementations.

/** Simple split parser */
static class SplitParser implements Parser {
    public String parse(String line) {
        String[] fields = line.split("\\t");
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Split parser, but with compiled pattern */
static class CompiledSplitParser implements Parser {
    private static final String regex = "\\t";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        String[] fields = pattern.split(line);
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Regex group parser */
static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        Matcher m = pattern.matcher(line);
        if (m.matches()) {
            return m.group(1);
        }
        return null;
    }
}

I ran each ten times against the same million line file. Here are the average results:

  • split: 2768.8 ms
  • compiled split: 1041.5 ms
  • group regex: 1015.5 ms

The clear conclusion is that it is important to compile your pattern, rather than rely on String.split, if you are going to use it repeatedly.

The result on compiled split versus group regex is not conclusive based on this testing. And probably the regex could be tweaked further for performance.

UPDATE

A further simple optimization is to re-use the Matcher rather than create one per loop iteration.

static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    // Matcher is not thread-safe...
    private Matcher matcher = pattern.matcher("");

    // ... so this method is no-longer thread-safe
    public String parse(String line) {
        matcher = matcher.reset(line);
        if (matcher.matches()) {
            return matcher.group(1);
        }
        return null;
    }
}
Community
  • 1
  • 1
ewan.chalmers
  • 16,145
  • 43
  • 60