What perl regex matches a "word" in the following filename?
I have a series of file names in which some words appear more than once:
john_smith_on_alaska_trip_john_smith_0001.jpg
His wife's name is Olga, with an umlaut over the o, and there are a few other names with diacritics; all lower case, in my situation, but not simply English a-z. The .jpg has been temporarily stripped off for other reasons, and may be ignored for this discussion.
I want to remove the duplicate names/words. Something like this works fine in emacs:
s/(\b\w{3,}\b)(.*)(\b\1\b)/\1\2/
Run it once, the above turns to: john_smith_on_alaska_trip__smith_0001.jpg
Again: john_smith_on_alaska_trip___0001.jpg
In Perl, this does not work because \w
includes the _
as a word character.
Worse yet - the anchor, \b
is anything other than those characters, and therefore doesn't separate on _
.
My current solution is to replace all _
with , do the deed, and revert. But, this seems such a fundamental requirement, I feel I must be missing something.
Thank you.