0

Okay so I want remove duplicate lines but it's a bit more complicated than that..

I have a file named users.txt, example of file is:

users:email@email.com
users1:email@email.com

Now due to a bug in my system people were able to register with the same email as someone else, so I want to remove if lines have the same email more than once, example of issue:

user:display:email@email.com
user2:email@email.com
user3:email@email.com
user4:email@email.com

Notice how user, user2, user3, user4 all have the same email.. well I want to remove user2, user3, user4 but keep user.. or vice versa ( first one to be picked up by request ) remove any other lines containing same email..

so if

email@email.com is in 20 lines remove 19
spam@spam.com is in 555 lines remove 554

and so fourth..

user3255841
  • 113
  • 1
  • 2
  • 8
  • 1
    Use the email as the index in an `awk` array. When you're processing each line, if the email isn't in the array, print the line and add it to the array. – Barmar Mar 01 '17 at 23:23
  • See http://stackoverflow.com/questions/2604088/awk-remove-line-if-field-is-duplicate – Barmar Mar 01 '17 at 23:25
  • Can you explain what you mean by "first one to be picked up by request"? What exactly is your criterion to choose which line remains? First username in alphabetical order? First one to appear in the file? – Fred Mar 02 '17 at 00:24

2 Answers2

0

This can be done with awk:

awk '!a["user:display:email@email.com"]++' filename

++ means, turn to True. So, after it matches print finding.

! is used in this case, to turn that around. So after match it turns to false. (as in do not print after match)

example:

$ awk 'a["user:display:email@email.com"]++' filename 
user2:email@email.com
user3:email@email.com
user4:email@email.com
line_random1
linerandom_2_

Now with !

$ awk '!a["user:display:email@email.com"]++' filename
user:display:email@email.com

So, now you just need to filter out what to awk on. No idea how big your file is, to count at least the entries I would do the following:

$ grep -o 'email@email.com' filename | wc -l
4

If you know what to awk on, just write it to a new file - just to be save.

awk '!a["user:display:email@email.com"]++' filename >> new_filename
rowan
  • 431
  • 3
  • 5
0

awk to the rescue!

$ awk -F: '!a[$NF]++' file 

user:display:email@email.com
karakfa
  • 66,216
  • 7
  • 41
  • 56