String tokenization in java (LARGE text)

Question

I have this large text (read LARGE). I need to tokenize every word, delimit on every non-letter. I used StringTokenizer to read one word at a time. However, as I was researching how to write the delimiter string ("every non-letter") instead of doing something like:

new StringTokenizer(text, "\" ();,.'[]{}!?:”“…\n\r0123456789 [etc etc]");

I found that everyone basically hates StringTokenizer (why?).

So, what can I use instead? Dont suggest String.split as it will duplicate my large text. I need to go through the text word by word and delimit on every non-letter. Is it easier to build something on my own or is there some best practice way to confront this problem?

Thanks in advance!

How large is your text, really? Does it fits in memory? – Basile Starynkevitch Apr 07 '12 at 08:12 — Basile Starynkevitch, Apr 07 '12 at 08:12

score 3 · Answer 1 · answered Apr 07 '12 at 08:19

StringTokenizer, as per the docs "StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead." That pretty much sums up the StringTokenizer hate.

If memory is really a concern, you can just iterate over the string character-by-character and substring between delimiters, do your processing, then move on.

That is, build something on my own. Yeah, guess thats what ill have to do. — jelgh, Apr 07 '12 at 08:21

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

You can use the flexible string Splitter class from Google's guava library.

If you need something more powerful, have a look at StandardTokenizer from Apache Lucene. From the docs:

This should be a good tokenizer for most European-language documents:

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.

Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.

Recognizes email addresses and internet hostnames as one token.

score 1 · Answer 3 · answered Apr 07 '12 at 08:36

1

It your grammar is complex and your file is large you can consider to use JavaCC.

When I'm in your situation I use it.

answered Apr 07 '12 at 08:36

dash1e

7,677
1
30
35

score 0 · Answer 4 · answered Apr 07 '12 at 15:52

0

Scanner.class read word by word (or line by line), and it can be used on large file (or input stream).

Pattern for RegEx can detect space, and many things (look at § where you can find something like \p{..}

answered Apr 07 '12 at 15:52

cl-r

1,264
1
12
26

score -1 · Answer 5 · answered Apr 07 '12 at 08:27

-1

I was never a fan of regex, but I can't see anything wrong with just using "[^a-zA-Z]" for the StringTokenizer.

answered Apr 07 '12 at 08:27

josephus

8,284
1
37
57

1

The delimiter string in StringTokenizer isn't compiled as a regex. So it wouldn't work. – jelgh Apr 07 '12 at 08:30

String tokenization in java (LARGE text)

5 Answers5

Linked