How do I recognize a "text" word via regex?

Question

What perl regex matches a "word" in the following filename?

I have a series of file names in which some words appear more than once:

john_smith_on_alaska_trip_john_smith_0001.jpg

His wife's name is Olga, with an umlaut over the o, and there are a few other names with diacritics; all lower case, in my situation, but not simply English a-z. The .jpg has been temporarily stripped off for other reasons, and may be ignored for this discussion.

I want to remove the duplicate names/words. Something like this works fine in emacs:

s/(\b\w{3,}\b)(.*)(\b\1\b)/\1\2/

Run it once, the above turns to: john_smith_on_alaska_trip__smith_0001.jpg

Again: john_smith_on_alaska_trip___0001.jpg

In Perl, this does not work because \w includes the _ as a word character. Worse yet - the anchor, \b is anything other than those characters, and therefore doesn't separate on _.

My current solution is to replace all _ with , do the deed, and revert. But, this seems such a fundamental requirement, I feel I must be missing something.

Thank you.

possible duplicate of [What is the best way to match only letters in a regex?](http://stackoverflow.com/questions/3754097/what-is-the-best-way-to-match-only-letters-in-a-regex) — Barmar, Oct 25 '14 at 02:09
This question isn't a duplicate, he wants to include `_` in his match sometimes. — hmatt1, Oct 25 '14 at 02:12
1. Already found that comment. 2. It has so many false starts, it does not have the value it could. I imagine most people who find that page walk away in frustration. 3. For key useful pages, perhaps like that one, a knowledgeable person with a few minutes of cleanup time might save a lot of time in the long run (if people understand it, they may not ask a new one). This is only my second day with Perl, so I won't take the chance of wrecking it by [try] to fix it... :-) — Matthew Miner, Oct 25 '14 at 04:09

Miller · Accepted Answer · 2014-10-25T03:28:12.143

2

Use the Character Class \p{Alpha} and Lookbehind and Lookahead assertions in place of word boundaries to ensure that the each word is a whole word instead of a substring:

use strict;
use warnings;

my $file = "john_smith_on_alaska_trip_john_smith_0001_johnsmith.jpg";

1 while $file =~ s{
    (?<!\p{Alpha}) ( \p{Alpha}++ )     # Word surrounded by non-word chars
    .* \K                              # Keep everything before this point
    (?<!\p{Alpha}) \1 (?!\p{Alpha})    # Strip duplicate word 
}{}x;

print "$file\n";

Outputs:

john_smith_on_alaska_trip___0001_johnsmith.jpg

Live Demo

edited Oct 25 '14 at 03:28

answered Oct 25 '14 at 02:13

Miller

34,962
4
39
60

Welcome to your day 2 with Perl. Note, there are quite a few regex concepts I used in this solution, so I would recommend a general reading of [perlretut](http://perldoc.perl.org/perlretut.html) and [perlre](http://perldoc.perl.org/perlre.html) whenever you're ready to start picking up more re tools. – Miller Oct 25 '14 at 04:22
100% on the money. And, learned a lot - \K, ++, (? - all new (to me :-). I'd vote it up, but don't have the rep. You've been so helpful, I hate to ask, what does the '1' mean (to the left of 'while'), haven't found that yet. – Matthew Miner Oct 25 '14 at 04:29
Because we're modifying the string back to front, we can't utilize the `/g` Modifier. Instead, I just put the regex in a loop with the regex as the condition, and a dummy expression as the BLOCK. It's equivalent to `while ($file =~ s/.../.../) { }` except using the statement modifier form of `while`. – Miller Oct 25 '14 at 05:02
The statement is in the form: 1 while ( $f =~ s/.../.../ ) {} x; It's the '1' in the beginning and 'x' at the end that have me confused. They seem unnecessary. – Matthew Miner Oct 25 '14 at 05:23
The [`/x` Modifier](http://perldoc.perl.org/perlre.html#%2fx) enables arbitrary whitespace and comments within the regex. Additionally, I'm using braces `s{}{}` as an alternative regex delimiter instead of `s///`. This is because I consider it easier to read when the regex spans lines. Finally, for the alternative `while` format, read [perlsyn - Statement Modifiers](http://perldoc.perl.org/perlsyn.html#Statement-Modifiers). – Miller Oct 25 '14 at 05:31
Got it - you were using the form: while ; Since I didn't recognize "1" as a statement, the structure made no sense to me. Again, thanks for your time - the \K was a totally new concept. – Matthew Miner Oct 25 '14 at 06:00

ThisSuitIsBlackNot · Answer 2 · 2014-10-25T04:31:36.153

You can use split to separate your string into its constituent pieces and then check for duplicates using a hash:

use strict;
use warnings;

my $string = 'john_smith_on_alaska_trip_john_smith_0001.jpg';
my @words = split /_/, $string;

my %count;
foreach my $word (@words) {
    $word = '' if ++$count{$word} > 1;
}

print join('_', @words), "\n";

Output:

john_smith_on_alaska_trip___0001.jpg

Alternatively, you could use uniq from List::MoreUtils to get the unique words, although this will change your output slightly by eliminating the consecutive underscores after trip:

use strict;
use warnings;

use List::MoreUtils 'uniq';

my $string = 'john_smith_on_alaska_trip_john_smith_0001.jpg';
my @words = split /_/, $string;

print join('_', uniq @words), "\n";

Output:

john_smith_on_alaska_trip_0001.jpg

How do I recognize a "text" word via regex?

2 Answers2

Output:

Output: