3

I need a clear text with only words, excluding all digits, extra spaces, dashes, commas, dots, brackets, etc. It is used for a word generation algorithm (taken from gamasutra). I suppose that regular expression can help here. How can I do this with help of String.split?

UPD:

Input: I have 1337 such a string with different stuff in it: commas, many spaces, digits - 2 3 4, dashes. How can I remove all stuff?

Output: I have such a string with different stuff in it commas many spaces digits dashes How can I remove all stuff

Community
  • 1
  • 1
vladfau
  • 1,003
  • 11
  • 22
  • possible duplicate of [Splitting strings through regular expressions by punctuation and whitespace etc in java](http://stackoverflow.com/questions/7384791/splitting-strings-through-regular-expressions-by-punctuation-and-whitespace-etc) – xlecoustillier Jun 12 '13 at 08:36
  • 1
    Please add an example with input text and expected output text. – pepuch Jun 12 '13 at 08:38

3 Answers3

4

In two steps you could do:

String s = "asd asd   asd.asd, asd";
String clean = s.replaceAll("[\\d[^\\w\\s]]+", " ").replaceAll("(\\s{2,})", " ");
System.out.println(clean);

The first step removes all characters that are not a letter or a space and replaces them with a space. The second step removes multiple spaces by only one space.

Output:

asd asd asd asd asd


If all you need is an array containing the words, then this would be enough:

String[] words = s.trim().split("[\\W\\d]+");
assylias
  • 321,522
  • 82
  • 660
  • 783
3

If you care about Unicode (you should), then use Unicode properties.

String[] result = s.split("\\P{L}+");

\p{L} is the Unicode property for a letter in any language.

\P{L} is the negation of \p{L}, means it will match everything that is not a letter. (I understood that is what you want.)

stema
  • 90,351
  • 20
  • 107
  • 135
  • This perfectly fits in our scenario where we required an accurate length of strings coming from WordPress (via GraphQL). They were showing a different length (`string.length`, usually +1 than the real length) due to the presence of non-Unicode characters and this helped to purge those. – KeshavDulal Dec 15 '21 at 06:31
1

I would do it this way

    str = str.replaceAll("\\s+", " ");
    str = str.replaceAll("\\p{Punct}|\\d", "");
    String[] words = str.split(" ");
Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275