Printing lines with duplicate words

Question

I am trying to print all line that can contain same word twice or more

E.g. with this input file:

cat dog cat
dog cat deer
apple peanut banana  apple
car bus train plane
car train car train

Output should be

cat dog cat
apple peanut banana  apple
car train car train.

I have tried this code and it works but I think there must be a shorter way.

awk '{ a=0;for(i=1;i<=NF;i++){for(j=i+1;j<=NF;j++){if($i==$j)a=1} } if( a==1 ) print $0}'

Later I want to find all such duplicate words and delete all the duplicate entries except for 1st occurrence.

So input:

cat dog cat lion cat 
dog cat deer
apple peanut banana  apple
car bus train plane
car train car train

Desired output:

cat dog lion 
dog cat deer
apple peanut banana  
car bus train plane
car train

regex are used in other languages as well like perl and ruby but I want to stick with regex in awk , sed and grep so I put those tags. — Vicky, Jan 19 '17 at 19:33

Lars Fischer · Accepted Answer · 2017-01-19T21:08:19.943

3

You can use this GNU sed command:

sed -rn '/(\b\w+\b).*\b\1\b/ p' yourfile

-r activate extended re and n deactivates the implicit printing of every line
the p command then prints only lines that match the preceding re (inside the slashes):
- \b\w+\b are words : an nonemtpy sequence of word charactes (\w) between word boundaries (\b`), these are GNU extensions
- such a word is "stored" in \1 for later reuse, due to the use of parentheses
- then we try to match this word with \b\1\b again with something optional (.*) between those two places.
- and that is the whole trick: match something, put it in parentheses so you can reuse it in the same re with \1

To answer the second part of the question, deleting the doubled words after the first, but print all lines (modifying only the lines with doubled words), you could use some sed s magic:

sed -r ':A s/(.*)(\b\w+\b)(.*)\b\2\b(.*)/\1\2\3\4/g; t A ;'

here we use again the backreference trick.
but we have to account for the things before, between and after our doubled words, thus we have a \2 in the matching part of then s command and we have the other backreferences in the replacement part.
notice that only the \2 has no parens in the matching part and we use all groups in the replacement, thus we effectively deleted the second word of the pair.
for more repetitions of the word we need loop:
- :A is a label
- t A jumps to the label if there was a replacement done in the last s comamnd
- this builds a "while loop" around the s to delete the other repetitions, too

edited Jan 19 '17 at 21:08

answered Jan 19 '17 at 19:09

Lars Fischer

9,135
3
26
35

Thanks Lars , that helps I think by \b\w+\w in the third point you mean \b\w+\b . Can you advise on the second part of the question " Deleting all the duplicate words but the 1st occurrence ? – Vicky Jan 19 '17 at 19:30
@user3369871 Thanks for the correction. Please add desired output for the second part to the question. E.g. what should happen with the lines that have no doubles, should they be printed again? – Lars Fischer Jan 19 '17 at 19:38
Just realized that you posted the same logic as me. Never knew that back references are possible in extended posix regexes. – hek2mgl Jan 19 '17 at 19:41
Just checked `egrep`. Works too. – hek2mgl Jan 19 '17 at 19:44
@LarsFischer; I have added sample input and out put to my second part , so I want to remove all the duplicate words in every line but keep the first occurrence of duplicate word – Vicky Jan 19 '17 at 19:46
1

@user3369871 Updated the answer for the second part. Bill Karwins quote of Jamie Zawinski is so true: starting with something easys and understandable the RE evolved fast to something uncomprehensible :) – Lars Fischer Jan 19 '17 at 20:12
@LarsFischer: Many thanks for your solution great help. – Vicky Jan 20 '17 at 05:13

Bill Karwin · Answer 2 · 2017-01-19T22:36:08.057

2

Here's a solution for printing only lines that contain duplicate words.

awk '{
  delete seen
  for (i=1;i<=NF;++i) {
    if (seen[$i]) { print ; next }
    seen[$i] = 1 
  }
}'

Here's a solution for deleting duplicate words after the first.

awk '{
  delete seen
  for (i=1;i<=NF;++i) {
    if (seen[$i]) { continue }
    printf("%s ", $i);
    seen[$i] = 1 
  }
  print "";
}'

Re your comment...

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. — Jamie Zawinski, 1997

edited Jan 19 '17 at 22:36

answered Jan 19 '17 at 17:59

Bill Karwin

538,548
86
673
828

I am actually looking for regex which can do this. – Vicky Jan 19 '17 at 18:03
1

@EdMorton, Thanks, those are good suggestions, I have edited the above. – Bill Karwin Jan 19 '17 at 22:36
1

@EdMorton, thanks, but I have lost interest in this issue since the OP doesn't seem to appreciate the answers. – Bill Karwin Jan 19 '17 at 23:10
Yeah me too, just deleted my answer since it's not what the OP is looking for. – Ed Morton Jan 20 '17 at 18:49

hek2mgl · Answer 3 · 2017-01-19T20:46:54.347

1

With egrep you can use a so called back reference:

egrep '(\b\w+\b).*\b\1\b' file

(\b\w+\b) matches a word at word boundaries in capturing group 1. \1 references that matched word in the pattern.

edited Jan 19 '17 at 20:46

answered Jan 19 '17 at 19:38

hek2mgl

152,036
28
249
266

score 0 · Answer 4 · edited May 23 '17 at 12:16

I'll show solutions in Perl as it is probably the most flexible tool for text parsing, especially when it comes to regular expressions.

Detecting Duplicates

perl -ne 'print if m{\b(\S+)\b.*?(\b\1\b)}g' file

where

-n causes Perl to execute the expression passed via -e for each input line;
\b matches word boundaries;
\S+ matches one or more non-space characters;
.*? is a non-greedy match for zero or more characters;
\1 is a backreference to the first group, i.e. the word \S+;
g globally matches the pattern repeatedly in the string.

Removing Duplicates

perl -pe '1 while (s/\b(\S+)\b.*?\K(\s\1\b)//g)' file

where

-p causes Perl to print the line ($_), like sed;
1 while loop runs as long as the substitution replaces something;
\K keeps the part matching the previous expression;

Duplicate words (\s\1\b) are replaced with empty string (//g).

Why Perl?

Perl regular expressions are known to be very flexible, and regular expressions in Perl are actually more than just regular expressions. For example, you can embed Perl code into the substitution using the /e modifier. You can use the /x modifier that allows to write regular expressions in a more readable format and even use Perl comments in it, e.g.:

perl -pe '1 while (
  s/            # Begins substitution: s/pattern/replacement/flags
  \b (\S+) \b   # A word
  .*?           # Ungreedy pattern for any number of characters
  \K            # Keep everything that matched the previous patterns
  (             # Group for the duplicate word:
    \s          #   - space
    \1          #   - backreference to the word
    \b          #   - word boundary
  )
  //xg
)' file

As you should have noticed, the \K anchor is very convenient, but is not available in many popular tools including awk, bash, and sed.

Printing lines with duplicate words

4 Answers4

Detecting Duplicates

Removing Duplicates

Why Perl?