Simple Stuff
Despite the inaccurate protestation that this is impossible with a regex, it certainly is.
While @cjm justly states that it is a lot easier to negate a positive match than it is to express a negative one as a single pattern, the model for doing so is sufficiently well-known that it becomes a mere matter of plugging things into that model. Given that:
/X/
matches something, then the way to express the condition
! /X/
in a single, positively-matching pattern is to write it as
/\A (?: (?! X ) . ) * \z /sx
Therefore, given that the positive pattern is
/ (\pL) .* \1 /sxi
the corresponding negative needs must be
/\A (?: (?! (\pL) .* \1 ) . ) * \z /sxi
by way of simple substitution for X.
Real-World Concerns
That said, there are extenuating concerns that may sometimes require more work. For example, while \pL
describes any code point having the GeneralCategory=Letter property, it does not consider what to do with words like red‐violet–colored, ’Tisn’t, or fiancée — the latter of which is different in otherwise-equivalent NFD vs NFC forms.
You therefore must first run it through full decomposition, so that a string like "r\x{E9}sume\x{301}"
would correctly detect the duplicate “letter é’s” — that is, all canonically equivalent grapheme cluster units.
To account for such as these, you must at a bare minimum first run your string through an NFD decomposition, and then afterwards also use grapheme clusters via \X
instead of arbitrary code points via .
.
So for English, you would want something that followed along these lines for the positive match, with the corresponding negative match per the substitution give above:
NFD($string) =~ m{
(?<ELEMENT>
(?= [\p{Alphabetic}\p{Dash}\p{Quotation_Mark}] ) \X
)
\X *
\k<ELEMENT>
}xi
But even with that there still remain certain outstanding issues unresolved, such as for example whether \N{EN DASH}
and \N{HYPHEN}
should be considered equivalent elements or different ones.
That’s because properly written, hyphenating two elements like red‐violet and colored to form the single compound word red‐violet–colored, where at least one of the pair already contains a hyphen, requires that one employ an EN DASH as the separator instead of a mere HYPHEN.
Normally the EN DASH is reserved for compounds of like nature, such as a time–space trade‐off. People using typewriter‐English don’t even do that, though, using that super‐massively overloaded legacy code point, HYPHEN-MINUS, for both: red-violet-colored.
It just depends whether your text came from some 19th‐century manual typewriter — or whether it represents English text properly rendered under modern typesetting rules. :)
Conscientious Case Insensitivity
You will note I am here considering letter that differ in case alone to be the same one. That’s because I use the /i
regex switch, ᴀᴋᴀ the (?i)
pattern modifier.
That’s rather like saying that they are the same as collation strength 1 — but not quite, because Perl uses only case folding (albeit full case folding not simple) for its case insensitive matches, not some higher collation strength than the tertiary level as might be preferred.
Full equivalence at the primary collation strength is a significantly stronger statement, but one that may well be needed to fully solve the problem in the general case. However, that requires a lot more work than the problem necessarily requires in many specific instances. In short, it is overkill for many specific cases that actually arise, no matter how much it might be needed for the hypothetical general case.
This is made even more difficult because, although you can for example do this:
my $collator = new Unicode::Collate::Locale::
level => 1,
locale => "de__phonebook",
normalization => undef,
;
if ($collator->cmp("müß", "MUESS") == 0) { ... }
and expect to get the right answer — and you do, hurray! — this sort of robust string comparison is not easily extended to regex matches.
Yet. :)
Summary
The choice of whether to under‐engineer — or to over‐engineer — a solution will vary according to individual circumstances, which no one can decide for you.
I like CJM’s solution that negates a positive match, myself, although it’s somewhat cavalier about what it considers a duplicate letter. Notice:
while ("de__phonebook" =~ /(?=((\w).*?\2))/g) {
print "The letter <$2> is duplicated in the substring <$1>.\n";
}
produces:
The letter <e> is duplicated in the substring <e__phone>.
The letter <_> is duplicated in the substring <__>.
The letter <o> is duplicated in the substring <onebo>.
The letter <o> is duplicated in the substring <oo>.
That shows why when you need to match a letter, you should alwasy use \pL
ᴀᴋᴀ \p{Letter}
instead of \w
, which actually matches [\p{alpha}\p{GC=Mark}\p{NT=De}\p{GC=Pc}]
.
Of course, when you need to match an alphabetic, you need to use \p{alpha}
ᴀᴋᴀ\p{Alphabetic}
, which isn’t at all the same as a mere letter — contrary to popular misunderstanding. :)