31

I have this text file that I read into a Java application and then count the words in it line by line. Right now I am splitting the lines into words by a

String.split([\\p{Punct}\\s+])"

But I know I am missing out on some words from the text file. For example, the word "can't" should be divided into two words "can" and "t".

Commas and other punctuation should be completely ignored and considered as whitespace. I have been trying to understand how to form a more precise Regular Expression to do this but I am a novice when it comes to this so I need some help.

What could be a better regex for the purpose I have described?

stema
  • 90,351
  • 20
  • 107
  • 135
Snorkelfarsan
  • 485
  • 1
  • 6
  • 11
  • I don't think you can easily do that using a regex. While you can solve the `can't` problem, you will face other problems, soon. See some interesting answers here (not really a duplicate of your question): http://stackoverflow.com/questions/6848869/how-i-count-the-words-and-expressions-in-a-text – Lukas Eder Sep 12 '11 at 08:18

5 Answers5

30

You have one small mistake in your regex. Try this:

String[] Res = Text.split("[\\p{Punct}\\s]+");

[\\p{Punct}\\s]+ move the + form inside the character class to the outside. Other wise you are splitting also on a + and do not combine split characters in a row.

So I get for this code

String Text = "But I know. For example, the word \"can\'t\" should";

String[] Res = Text.split("[\\p{Punct}\\s]+");
System.out.println(Res.length);
for (String s:Res){
    System.out.println(s);
}

this result

10
But
I
know
For
example
the
word
can
t
should

Which should meet your requirement.

As an alternative you can use

String[] Res = Text.split("\\P{L}+");

\\P{L} means is not a unicode code point that has the property "Letter"

stema
  • 90,351
  • 20
  • 107
  • 135
  • P{L} gave me the same ouput as your previous suggestion. thanks though – Snorkelfarsan Sep 12 '11 at 09:02
  • @Snorkelfarsan Yes for my test string its also giving the same result. Maybe there are some corner cases where it covers other characters than whitespace and punctuation. At the moment I can't think of such a condition. – stema Sep 12 '11 at 09:10
  • I'm get spaces before the word few times per each sentence. Could you assist me with it? – Vitali Pom Nov 06 '13 at 11:29
16

There's a non-word literal, \W, see Pattern.

String line = "Hello! this is a line. It can't be hard to split into \"words\", can it?";
String[] words = line.split("\\W+");
for (String word : words) System.out.println(word);

gives

Hello
this
is
a
line
It
can
t
be
hard
to
split
into
words
can
it
Qwerky
  • 18,217
  • 6
  • 44
  • 80
0

If you come here from Kotlin sentence.split(Regex("[\\p{Punct}\\s]+"))

0

Well, seeing you want to count can't as two words , try

split("\\b\\w+?\\b")

http://www.regular-expressions.info/wordboundaries.html

amal
  • 1,369
  • 8
  • 15
  • Thank you for the timely reply. Does the regex denote something in the style of: split words within the boundary of a word with one or more surrounding punctuations? – Snorkelfarsan Sep 12 '11 at 08:29
  • well , the regex translates to something like for one or more word characters surrounded by word boundaries. The ? denotes that a non-greedy(lazy) match is applied . – amal Sep 12 '11 at 08:35
0

Try:

line.split("[\\.,\\s!;?:\"]+");
or         "[\\.,\\s!;?:\"']+"

This is an or match of one of these characters: ., !;?:"' (note that there is a space in there but no / or \) the + causes several chars together to be counted as one.

That should give you a mostly sufficient accuracy. More precise regexes would need more information about the type of text you need to parse, because ' can be a word delimiter as well. Mostly the most punctuation word delimiters are around a whitespace so matching on [\\s]+ would be a close approximation as well. (but gives the wrong count on short quotations like: She said:"no".)

Angelo Fuchs
  • 9,825
  • 1
  • 35
  • 72
  • Unfortunately that gave me even fewer results than [\\p{Punct}\\s]+ – Snorkelfarsan Sep 12 '11 at 08:54
  • After re-reading your initial post: I misread it and thought you want to read "can't" as one word instead of two. Try : "[\.,\s!;?:\"']+". – Angelo Fuchs Sep 12 '11 at 09:08
  • Afterquestion: in your initial post you use [\\p{Punct}\\s+] here you write the + after the ]. Could you clarify what you expect for some lines, please? (e.G. I can't. She said:"No". He's[sic] the matter!) – Angelo Fuchs Sep 12 '11 at 09:17
  • I changed it to the "+" after "]". anyhow, the main issue was not a lack of proper regex syntax, but a string.toLowerCase() call... The word counter didn't see I and i or The and the as the same word until i made all the input lowercase. Problem solved. Thank you! – Snorkelfarsan Sep 12 '11 at 09:32