I'll show solutions in Perl as it is probably the most flexible tool for text parsing, especially when it comes to regular expressions.
Detecting Duplicates
perl -ne 'print if m{\b(\S+)\b.*?(\b\1\b)}g' file
where
-n
causes Perl to execute the expression passed via -e
for each input line;
\b
matches word boundaries;
\S+
matches one or more non-space characters;
.*?
is a non-greedy match for zero or more characters;
\1
is a backreference to the first group, i.e. the word \S+
;
g
globally matches the pattern repeatedly in the string.
Removing Duplicates
perl -pe '1 while (s/\b(\S+)\b.*?\K(\s\1\b)//g)' file
where
-p
causes Perl to print the line ($_
), like sed;
1 while
loop runs as long as the substitution replaces something;
\K
keeps the part matching the previous expression;
Duplicate words (\s\1\b
) are replaced with empty string (//g
).
Why Perl?
Perl regular expressions are known to be very flexible, and regular expressions in Perl are actually more than just regular expressions. For example, you can embed Perl code into the substitution using the /e
modifier. You can use the /x
modifier that allows to write regular expressions in a more readable format and even use Perl comments in it, e.g.:
perl -pe '1 while (
s/ # Begins substitution: s/pattern/replacement/flags
\b (\S+) \b # A word
.*? # Ungreedy pattern for any number of characters
\K # Keep everything that matched the previous patterns
( # Group for the duplicate word:
\s # - space
\1 # - backreference to the word
\b # - word boundary
)
//xg
)' file
As you should have noticed, the \K
anchor is very convenient, but is not available in many popular tools including awk, bash, and sed.