1

I want to write a regular expression to remove all tokens of a text file that do not have at least one letter. I used OpenNLP tokenizer for extracting tokens of my text file.For instance, tokens 90-87, 65@7, ---, 8/0, ? are removed from given text.

I tried to follow these pages 1 ,2 and 3; but I could not find the expression that I want. For example, the following code remove token anti-age, mid-november.

String[] tokens = t.getTokens(sen);

for (String word : tokens) 
    if((!isstopWord(word)) && word.matches("[a-zA-Z]+"))
          bufferedw.append(word+"\n");

But, I do not know how to prevent removing tokens like anti-age.

where is the problem?

Community
  • 1
  • 1
Suri
  • 209
  • 1
  • 4
  • 10

1 Answers1

2

The [a-zA-Z]+ expression matches a string that only consists of one or more ASCII letters. It does not allow hyphens, apostrophes, etc.

To match a string containing no spaces and at least one letter, you can use

word.matches("\\S*\\pL\\S*")

See IDEONE demo

The \S* pattern matches zero or more non-whitespace characters and \pL matches any Unicode letter.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563