0

I need to filter the given text to get all words, including apostrophes (can't is considered a single word).

Para = "'hello' world '"

I am splitting the text using

String[] splits = Para.split("[^a-zA-Z']");

Expected output:

hello world

But it is giving:

'hello' world '

I get everything right, except a single apostrophe (') and 'hello' are not getting filtered by the above regex.

How can I filter these two things?

Hari Chaudhary
  • 630
  • 1
  • 7
  • 20
  • 1
    You want to split a `String` into words? Use `\\b` - this is the regex shortcut for "word boundary". It might well do the trick depending on your exact requirements. – Boris the Spider Feb 10 '14 at 08:23
  • 2
    @BoristheSpider: It can't take care of the case of `can't`. And by the way, what is your original string? I think the result should be correct with your code here. – nhahtdh Feb 10 '14 at 08:23
  • possible duplicate of [How do you use the Java word boundary with apostrophes?](http://stackoverflow.com/questions/4769652/how-do-you-use-the-java-word-boundary-with-apostrophes) – Boris the Spider Feb 10 '14 at 08:24
  • @BoristheSpider: Special characters are not getting filtered by this. – Hari Chaudhary Feb 10 '14 at 08:25
  • @HariChaudhary yes, I forgot about the broken nature of `\\b` in Java. See the duplicate I linked. – Boris the Spider Feb 10 '14 at 08:28
  • @nhahtdh "'hello' world" , my code gives 'hello' world – Hari Chaudhary Feb 10 '14 at 08:28
  • 1
    @HariChaudhary: What is your expected result, then? (Again, please edit your question with a clear example, the expected result and what you actually get). – nhahtdh Feb 10 '14 at 08:29
  • @nhahtdh presumably if the apostrophe is _inside_ a word it should be treated as part of that word as if it is _outside_ it should be treated as a boundary. – Boris the Spider Feb 10 '14 at 08:30
  • @BoristheSpider: Then the duplicate question most probably doesn't have an answer, then. – nhahtdh Feb 10 '14 at 08:32
  • @nhahtdh It in fact does. – Boris the Spider Feb 10 '14 at 08:33
  • @BoristheSpider: The problem is completely different in fact. The other one tries to search a string. This question is about splitting. (And by the way `\b` in Java is no longer broken when used with `(?U)` flag in Java 7, but by definition, it will split at `'`). – nhahtdh Feb 10 '14 at 08:36

3 Answers3

1

As far as I can tell, you're looking for a ' where either the next or previous character is not a letter.

The regex I came up with to do this, contained in some test code:

String str = "bob can't do 'well'";
String[] splits = str.split("(?:(?<=^|[^a-zA-Z])'|'(?=[^a-zA-Z]|$)|[^a-zA-Z'])+");
System.out.println(Arrays.toString(splits));

Explanation:

(?<=^|[^a-zA-Z])' - matches a ' where the previous character is not a letter, or we're at the start of the string.
'(?=[^a-zA-Z]|$) - matches a ' where the next character is not a letter, or we're at the end of the string.
[^a-zA-Z'] - not a letter or '.
(?:...)+ - one or more of any of the above (the ?: is just to make it a non-capturing group).

See this for more on regex lookaround ((?<=...) and (?=...)).

Simplification:

The regex can be simplified to the below by using negative lookaround:

"(?:(?<![a-zA-Z])'|'(?![a-zA-Z])|[^a-zA-Z'])+"
Community
  • 1
  • 1
Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
1

A Unicode version, without lookarounds:

String TestInput = "This voilà München is the test' 'sentence' that I'm willing to split";

String[] splits = TestInput.split("'?[^\\p{L}']+'?");

for (String t : splits) {
    System.out.println(t);
}

\p{L} is matching a character with the Unicode property "Letter"

This splits on a non letter, non ' sequence, including a leading or trailing ' in the split.

Output:

This
voilà
München
is
the
test
sentence
that
I'm
willing
to
split

To handle leading and trailing ', just add them as alternatives

TestInput.split("'?[^\\p{L}']+'?|^'|'$")
stema
  • 90,351
  • 20
  • 107
  • 135
  • 1
    It fails for "'hello' world ' can't" (Leading ' in the sentence)? – aNish Feb 10 '14 at 09:28
  • @aNish, yes it doesn't handle those cases. I added a suggestion to my answer. This will add an empty entry in the resulting array. – stema Feb 10 '14 at 09:41
0

If you define a word as a sequence that:

  • Must start and end with English alphabet a-zA-Z
  • May contain apostrophe (') within.

Then you can use the following regex in Matcher.find() loop to extract matches:

[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?

Sample code:

Pattern p = Pattern.compile("[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?");
Matcher m = p.matcher(inputString);

while (m.find()) {
    System.out.println(m.group());
}

Demo1

1 The demo uses PCRE flavor regex, but the result should not be different from Java for this regex

nhahtdh
  • 55,989
  • 15
  • 126
  • 162