Splitting strings through regular expressions by punctuation and whitespace etc in java

Question

I have this text file that I read into a Java application and then count the words in it line by line. Right now I am splitting the lines into words by a

String.split([\\p{Punct}\\s+])"

But I know I am missing out on some words from the text file. For example, the word "can't" should be divided into two words "can" and "t".

Commas and other punctuation should be completely ignored and considered as whitespace. I have been trying to understand how to form a more precise Regular Expression to do this but I am a novice when it comes to this so I need some help.

What could be a better regex for the purpose I have described?

I don't think you can easily do that using a regex. While you can solve the `can't` problem, you will face other problems, soon. See some interesting answers here (not really a duplicate of your question): http://stackoverflow.com/questions/6848869/how-i-count-the-words-and-expressions-in-a-text — Lukas Eder, Sep 12 '11 at 08:18

stema · Accepted Answer · 2013-06-12T08:59:18.383

30

You have one small mistake in your regex. Try this:

String[] Res = Text.split("[\\p{Punct}\\s]+");

[\\p{Punct}\\s]+ move the + form inside the character class to the outside. Other wise you are splitting also on a + and do not combine split characters in a row.

So I get for this code

String Text = "But I know. For example, the word \"can\'t\" should";

String[] Res = Text.split("[\\p{Punct}\\s]+");
System.out.println(Res.length);
for (String s:Res){
    System.out.println(s);
}

this result

10
But
I
know
For
example
the
word
can
t
should

Which should meet your requirement.

As an alternative you can use

String[] Res = Text.split("\\P{L}+");

\\P{L} means is not a unicode code point that has the property "Letter"

edited Jun 12 '13 at 08:59

answered Sep 12 '11 at 08:31

stema

90,351
20
107
135

P{L} gave me the same ouput as your previous suggestion. thanks though – Snorkelfarsan Sep 12 '11 at 09:02
@Snorkelfarsan Yes for my test string its also giving the same result. Maybe there are some corner cases where it covers other characters than whitespace and punctuation. At the moment I can't think of such a condition. – stema Sep 12 '11 at 09:10
I'm get spaces before the word few times per each sentence. Could you assist me with it? – Vitali Pom Nov 06 '13 at 11:29

score 16 · Answer 2 · answered Sep 12 '11 at 08:25

There's a non-word literal, \W, see Pattern.

String line = "Hello! this is a line. It can't be hard to split into \"words\", can it?";
String[] words = line.split("\\W+");
for (String word : words) System.out.println(word);

gives

Hello
this
is
a
line
It
can
t
be
hard
to
split
into
words
can
it

score 0 · Answer 3 · answered Sep 22 '21 at 02:26

0

If you come here from Kotlin sentence.split(Regex("[\\p{Punct}\\s]+"))

answered Sep 22 '21 at 02:26

Jonathan Garcia Rey

91
2
5

score 0 · Answer 4 · answered Sep 12 '11 at 08:16

0

Well, seeing you want to count can't as two words , try

split("\\b\\w+?\\b")

http://www.regular-expressions.info/wordboundaries.html

answered Sep 12 '11 at 08:16

amal

1,369
8
15

Thank you for the timely reply. Does the regex denote something in the style of: split words within the boundary of a word with one or more surrounding punctuations? – Snorkelfarsan Sep 12 '11 at 08:29
well , the regex translates to something like for one or more word characters surrounded by word boundaries. The ? denotes that a non-greedy(lazy) match is applied . – amal Sep 12 '11 at 08:35

Angelo Fuchs · Answer 5 · 2014-05-19T17:57:08.213

0

Try:

line.split("[\\.,\\s!;?:\"]+");
or         "[\\.,\\s!;?:\"']+"

This is an or match of one of these characters: ., !;?:"' (note that there is a space in there but no / or \) the + causes several chars together to be counted as one.

That should give you a mostly sufficient accuracy. More precise regexes would need more information about the type of text you need to parse, because ' can be a word delimiter as well. Mostly the most punctuation word delimiters are around a whitespace so matching on [\\s]+ would be a close approximation as well. (but gives the wrong count on short quotations like: She said:"no".)

edited May 19 '14 at 17:57

answered Sep 12 '11 at 08:18

Angelo Fuchs

9,825
1
35
72

Unfortunately that gave me even fewer results than [\\p{Punct}\\s]+ – Snorkelfarsan Sep 12 '11 at 08:54
After re-reading your initial post: I misread it and thought you want to read "can't" as one word instead of two. Try : "[\.,\s!;?:\"']+". – Angelo Fuchs Sep 12 '11 at 09:08
Afterquestion: in your initial post you use [\\p{Punct}\\s+] here you write the + after the ]. Could you clarify what you expect for some lines, please? (e.G. I can't. She said:"No". He's[sic] the matter!) – Angelo Fuchs Sep 12 '11 at 09:17
I changed it to the "+" after "]". anyhow, the main issue was not a lack of proper regex syntax, but a string.toLowerCase() call... The word counter didn't see I and i or The and the as the same word until i made all the input lowercase. Problem solved. Thank you! – Snorkelfarsan Sep 12 '11 at 09:32

Splitting strings through regular expressions by punctuation and whitespace etc in java

5 Answers5

Linked

Related