Find all keywords of a text file that have at least one letter by using regular expression

Question

I want to write a regular expression to remove all tokens of a text file that do not have at least one letter. I used OpenNLP tokenizer for extracting tokens of my text file.For instance, tokens 90-87, 65@7, ---, 8/0, ? are removed from given text.

I tried to follow these pages 1 ,2 and 3; but I could not find the expression that I want. For example, the following code remove token anti-age, mid-november.

String[] tokens = t.getTokens(sen);

for (String word : tokens) 
    if((!isstopWord(word)) && word.matches("[a-zA-Z]+"))
          bufferedw.append(word+"\n");

But, I do not know how to prevent removing tokens like anti-age.

where is the problem?

When you say *"at least one character"* do you mean "at least one **letter**"? Because `9`, `-`, `@`, and `/` are all Unicode characters too. — Andreas, Feb 18 '16 at 18:35
Instead of an example, please specify what kind of tokens you need to keep, i.e. formulate the requirements. BTW, perhaps, you are looking for `word.matches("\\S*\\pL+\\S*")`. — Wiktor Stribiżew, Feb 18 '16 at 18:37
@WiktorStribiżew I want tokens which have at least one letter. — Suri, Feb 18 '16 at 18:40
@WiktorStribiżew you mean I should change my regular expression to this one `matches("\\S*\\pL+\\S*")` — Suri, Feb 18 '16 at 18:44
Well, `matches("\\S*\\pL\\S*")` is enough. `\pL` matches any Unicode letter. `\S` matches a non-whitespace character. — Wiktor Stribiżew, Feb 18 '16 at 18:44
@Arkadiy it is three token a, +, b. I use openNLP for tokenizing my text. — Suri, Feb 18 '16 at 18:45

score 2 · Accepted Answer · answered Feb 18 '16 at 18:46

The [a-zA-Z]+ expression matches a string that only consists of one or more ASCII letters. It does not allow hyphens, apostrophes, etc.

To match a string containing no spaces and at least one letter, you can use

word.matches("\\S*\\pL\\S*")

See IDEONE demo

The \S* pattern matches zero or more non-whitespace characters and \pL matches any Unicode letter.

Find all keywords of a text file that have at least one letter by using regular expression

1 Answers1