regex program alteration excluding whitespace

Question

I have a statement which finds strings that contain one character, say P. This works when matching against a string delimited by no white space

e.g.

APAXA

Thr regex being ^[^P]*P[^P]*$

It picks this string out fine, however, what if I have a string

XPA  DREP EDS

What would be the regex to identify all strings in one line that match the condition (strings always seperated by some kind of white space - tab, space etc)?

e.g. how would I highlight XPA and DREP

I am using while(m.find()) to loop multiple times and System.out.println(m.group())

so m.group has to contain the entire string.

What type of data is this? Just uppercase ASCII letters and ASCII spaces only? — tchrist, Jan 20 '11 at 14:39

score 2 · Accepted Answer · answered Jan 20 '11 at 13:59

2

Split it by whitespace and then check each token against your existing regex.

answered Jan 20 '11 at 13:59

jzd

23,473
9
54
76

That won't find `DREP` as the whitespace is part of the match condition. – Tim Pietzcker Jan 20 '11 at 14:07

score 1 · Answer 2 · answered Jan 20 '11 at 14:37

1

why must it be a an overly complicated regex?

String string = "XPA  DREP EDS";
String[] s = string.split("\\s+");
for( String str: s){
  if ( str.contains("P") ){
     System.out.println( str );
  }
}

answered Jan 20 '11 at 14:37

ghostdog74

327,991
56
259
343

score 0 · Answer 3 · answered Jan 20 '11 at 14:03

0

you can try and use the \s pattern (match whitespace). Look at this regexp page for java.

answered Jan 20 '11 at 14:03

hellatan

3,517
2
29
37

You mean match ASCII whitespace, as opposed to Unicode whitespace. – tchrist Jan 20 '11 at 14:37

score 0 · Answer 4 · edited May 23 '17 at 12:26

0

\b[^P\s]*P[^P\s]*\b

will match all words that contain exactly one P. Don't forget to double the backslashes when constructing your regex from a Java string.

Explanation:

\b      # Assert position at start/end of a word
[^P\s]* # Match any number of characters except P and whitespace
P       # Match a P
[^P\s]* # Match any number of characters except P and whitespace
\b      # Assert position at start/end of a word

Please note that \b doesn't match all word boundaries correctly when dealing with Unicode string (thanks tchrist for reminding me). If that is the case for you, you might want to replace the \bs with (don't look):

(?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))

(taken from this question's winning answer)

edited May 23 '17 at 12:26

Community

1
1

answered Jan 20 '11 at 14:08

Tim Pietzcker

328,213
58
503
561

It does here. It *is* supposed to match that word, isn't it? – Tim Pietzcker Jan 20 '11 at 14:15
Ah - you want the entire word to contain only one P? I get it. – Tim Pietzcker Jan 20 '11 at 14:25
No its supposed to match any word containing a single P. Sorry maybe I didn't mention that... :S – dr85 Jan 20 '11 at 14:28
That’s another ASCII-only pattern. It does not work “properly” on full Unicode, Java’s native character set. I strongly suggest some sort of commenting about that restriction. – tchrist Jan 20 '11 at 14:35

score 0 · Answer 5 · answered Jan 20 '11 at 14:13

Thr reex being ^[^P]P[^P]$

Such a regex finds only string containing exactly one P, which may or may not be what you want. I suppose you want .*P.* instead.

For finding all words containing at least one P you can use \\S+P\\S+, where \S stands for non-blank character. You may consider \w instead.

For finding all words containing exactly one P you can use [^\\sP]+P[^\\sP]+(?=\\s) which is more complicated. Here, \s stands for blank, [^abc] matches everything expect for abc, (?=...) is lookahead. Without the lookahead, you'd find in "APBPC" two "words": "APB" and "PC".

You're wrong, or do you really mean the following is ascii? final String s = "Příliš žluťoučký kůň úpěl ďábelské ódy"; final Pattern p = Pattern.compile("\\S+l\\S+"); final Matcher m = p.matcher(s); while (m.find()) System.out.println(m.group());` — maaartinus, Jan 20 '11 at 14:44

Bart Kiers · Answer 6 · 2011-01-20T14:56:10.087

0

Try adding whitespace characters (\s) in your negated character classes, and you'll also want to remove the ^ and $ anchors:

[^P\s]*P[^P\s]*

or as a Java String literal:

"[^P\\s]*P[^P\\s]*"

Note that the above does not work on Unicode, only ASCII (as tchrist mentioned in the comments).

edited Jan 20 '11 at 14:56

answered Jan 20 '11 at 14:24

Bart Kiers

166,582
36
299
288

With the proviso that that’s only going to work on ASCII characters, not non-ASCII Unicode characters. – tchrist Jan 20 '11 at 14:38

regex program alteration excluding whitespace

6 Answers6

Linked