22

I have a plain text file with words, which are separated by comma, for example:

word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3

i want to delete the duplicates and to become:

word1, word2, word3, word4, word5, word6, word7

Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....

cupakob
  • 8,411
  • 24
  • 67
  • 76
  • 1
    Do you want the words to be unique on a line basis, or over the whole file? Also do you want to maintain the original order of the words, or are you happy if the order is changed? – Beano Jun 04 '09 at 18:46
  • i need the uniq words in the whole file. the order of the words is not important. – cupakob Jun 04 '09 at 19:15
  • See also: [How can I find repeated words in a file using grep/egrep?](http://stackoverflow.com/q/33396629/562769) – Martin Thoma Jan 12 '17 at 11:13

10 Answers10

34

Assuming that the words are one per line, and the file is already sorted:

uniq filename

If the file's not sorted:

sort filename | uniq

If they're not one per line, and you don't mind them being one per line:

tr -s [:space:] \\n < filename | sort | uniq

That doesn't remove punctuation, though, so maybe you want:

tr -s [:space:][:punct:] \\n < filename | sort | uniq

But that removes the hyphen from hyphenated words. "man tr" for more options.

Randy Orrison
  • 1,259
  • 2
  • 15
  • 14
  • that works for me :) thanks a lot...i only need to put all words back in a one row with: cat testfile_out.txt | tr "\n" " " > testfile_out2.txt – cupakob Jun 04 '09 at 19:24
3

ruby -pi.bak -e '$_.split(",").uniq.join(",")' filename ?

I'll admit the two kinds of quotations are ugly.

Oliver N.
  • 2,496
  • 19
  • 20
  • 2
    Ruby isn't a Linux command! I presume by Linux command he means regular GNU programs. – Danny Jun 04 '09 at 18:52
  • @Danny, I saw that, and you could do this with some overzealous sed/awk alchemy, but really I think this is a job for a scripting language. – Oliver N. Jun 04 '09 at 19:16
  • +1 as this seems undeniably elegant, and more approachable for mortals compared to Igor Krivokon's Perl one-liner :) – Jonik Jun 04 '09 at 19:54
2

i had the very same problem today.. a word list with 238,000 words but about 40, 000 of those were duplicates. I already had them in individual lines by doing

cat filename | tr " " "\n" | sort 

to remove the duplicates I simply did

cat filename | uniq > newfilename .

Worked perfectly no errors and now my file is down from 1.45MB to 1.01MB

Peter Lang
  • 54,264
  • 27
  • 148
  • 161
Biffinum
  • 21
  • 1
  • Nice, straightforward solution that uses pipes the way the creator intended. :) You can avoid the "useless use of cat" (UUoC—it's so common that it has an abbreviation!) by using redirection, and you can take advantage of `sort`'s built-in deduplication option: `sort -u < filename > newfilename` – Adam Liss May 07 '22 at 18:17
2

Creating a unique list is pretty easy thanks to uniq, although most Unix commands like one entry per line instead of a comma-separated list, so we have to start by converting it to that:

$ sed 's/, /\n/g' filename | sort | uniq
word1
word2
word3
word4
word5
word6
word7

The harder part is putting this on one line again with commas as separators and not terminators. I used a perl one-liner to do this, but if someone has something more idiomatic, please edit me. :)

$ sed 's/, /\n/g' filename | sort | uniq | perl -e '@a = <>; chomp @a; print((join ", ", @a), "\n")'
word1, word2, word3, word4, word5, word6, word7
Ryan Bright
  • 3,495
  • 21
  • 20
  • tr " " "\n" might be more efficient than sed in this case – florin Jun 04 '09 at 19:03
  • Putting that on one line is quite simple: sed 's/, /\n/g' filename | sort | paste -s -d, | sed 's/,/, /g' the command is paste, a very nice one! –  Jun 04 '09 at 21:21
  • `tr " " "\n"' is different because it doesn't handle the commas and you can't just ignore the commas because the last word doesn't have one. With the example in the question, you'd end up uniq'ing "word3" and "word3,". Another answer has a tr command that would remove all whitespace and all punctuation if that's what you're after. I was just being specific. – Ryan Bright Jun 04 '09 at 23:52
2

Here's an awk script that will leave each line in tact, only removing the duplicate words:

BEGIN { 
     FS=", " 
} 
{ 
    for (i=1; i <= NF; i++) 
        used[$i] = 1
    for (x in used)
        printf "%s, ",x
    printf "\n"
    split("", used)
} 
mamboking
  • 4,559
  • 23
  • 27
  • that works also, but not perfect ;) the output contains a word with two commas....that ist not a big problem :) thanks a lot – cupakob Jun 04 '09 at 19:37
1

Came across this thread while trying to solve much the same problem. I had concatenated several files containing passwords, so naturally there were a lot of doubles. Also, many non-standard characters. I didn't really need them sorted, but it seemed that was gonna be necessary for uniq.

I tried:

sort /Users/me/Documents/file.txt | uniq -u
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `t\203tonnement' and `t\203tonner'

Tried:

sort -u /Users/me/Documents/file.txt >> /Users/me/Documents/file2.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `t\203tonnement' and `t\203tonner'.

And even tried passing it through cat first, just so I could see if we were getting a proper input.

cat /Users/me/Documents/file.txt | sort | uniq -u > /Users/me/Documents/file2.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `zon\351s' and `zoologie'.

I'm not sure what's happening. The strings "t\203tonnement" and "t\203tonner" aren't found in the file, though "t/203" and "tonnement" are found, but on separate, non-adjoining lines. Same with "zon\351s".

What finally worked for me was:

awk '!x[$0]++' /Users/me/Documents/file.txt > /Users/me/Documents/file2.txt

It also preserved words whose only difference was case, which is what I wanted. I didn't need the list sorted, so it was fine that it wasn't.

sudon't
  • 11
  • 1
1

I'd think you'll want to replace the spaces with newlines, use the uniq command to find unique lines, then replace the newlines with spaces again.

Paul Sonier
  • 38,903
  • 3
  • 77
  • 117
1

I presumed you wanted the words to be unique on a single line, rather than throughout the file. If this is the case, then the Perl script below will do the trick.

while (<DATA>)
{
    chomp;
    my %seen = ();
    my @words = split(m!,\s*!);
    @words = grep { $seen{$_} ? 0 : ($seen{$_} = 1) } @words;
    print join(", ", @words), "\n";
}

__DATA__
word1, word2, word3, word2, word4, word5, word3, word6, word7, word3

If you want uniqueness over the whole file, you can just move the %seen hash outside the while (){} loop.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Beano
  • 7,551
  • 3
  • 24
  • 27
  • 2
    Perl isn't a Linux command! I presume by Linux command he means regular GNU programs. Then again Perl is installed everywhere... heh. – Danny Jun 04 '09 at 18:53
  • Could you please point out what your definition of a "Linux command" is (or rather @rbright's as you seem to know him)? Maybe a command found in Linux distributions? – Beano Jun 04 '09 at 19:08
  • i mean a command, which is integrated in the default installation of the most popular distros...for example somethink like grep. – cupakob Jun 04 '09 at 19:13
  • +1 for your code. Needed a one liner to "unique"-ing a string of words. Thanks!! – GuruM Sep 11 '11 at 15:31
0

open file with vim (vim filename) and run sort command with unique flag (:sort u).

meysam
  • 1,754
  • 2
  • 20
  • 30
0

And don't forget the -c option for the uniq utility if you're interested in getting a count of the words as well.

Rob Wells
  • 36,220
  • 13
  • 81
  • 146