0

I am looking for a way to remove duplicate rows from my Notepad++ file. The rows are not exact duplicates per say. Here's the situation. I have a large file of capitalized company names with probability values as well (each separated by a tab). So the format would be like this:

ATT   .7213
SAMSUNG   .01294
SAMSUNG   .90222

So, I need to remove one of these rows because there is a match in the first column. I don't really have a preference of which one I need to remove just as long as I end up with one row at the end. I have tried to use unique sorting with TextFX but it's looking for the whole row duplicate and not just the first column. If anyone could offer up a handy solution to fix this I would greatly appreciate it. Bash script answers using awk, sed, or cut are also acceptable as well as using regular expressions.

Thank you!

Tastybrownies
  • 897
  • 3
  • 23
  • 49
  • possible duplicate of [Removing duplicate rows in Notepad++](http://stackoverflow.com/questions/3958350/removing-duplicate-rows-in-notepad) – Pankaj Jaju Feb 06 '14 at 18:05
  • @PankajJaju The question itself is different, but it appears one of the answers to the other question could be adapted to solve this. – chepner Feb 06 '14 at 18:24

2 Answers2

3

Using awk, you could say:

awk '!a[$1]++' filename

This would keep only the lines having a unique value for the first field.

devnull
  • 118,548
  • 33
  • 236
  • 227
  • Since this only keeps those with a unique value, would it remove both SAMSUNG rows? – Jon Senchyna Feb 06 '14 at 18:25
  • It wouldn't remove both `SAMSUNG` rows; it'd rather keep the first one and ignore the rest. – devnull Feb 07 '14 at 07:18
  • If that's the case, then your explanation in the answer is a bit confusing. Saying it would keep *only* the lines having a *unique* value would mean that all instances of a duplicate line would *not* be kept (since neither the first, nor the rest, are unique). – Jon Senchyna Feb 07 '14 at 12:32
  • @Jon The first time a Samsung row is encounter, `a[SAMSUNG]` has the value 0, which negated becomes a non-zero value, indicating the line should be printed. After the line is accepted, `a[SAMSUNG]` is incremented, so that in the future, `a[SAMSUNG]` will always have a non-zero value, which when negated becomes 0, which rejects the line. – chepner Feb 07 '14 at 17:51
  • On another note, this is more efficient than my answer, as it runs in O(n) time, as opposed to the O(n lg n) that my sorting-based answer requires. – chepner Feb 07 '14 at 17:52
  • @chepner I was referring to the wording of the answer, not the actual results. The wording of the answer is confusing, as the definition of "unique" would rule out even the first of a series of duplicates. – Jon Senchyna Feb 07 '14 at 21:49
1

Use sort:

sort -k1,1 -u companies.txt

The output will consist of the full line, but only the sorting key (the first field) will be considered for identifying duplicates.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • Thank you this worked perfectly and now I have what I need. So sorting and specifying key field 1 with -k1, then what exactly does the 1 after the comma do? I know the -u is for ask for uniqueness. – Tastybrownies Feb 07 '14 at 17:46
  • `-k1`, by itself, uses fields 1 through the end of the record. For instance, to sort on fields 2 through 5, you might use `-k2,5`. `-k1,1` limits the comparison to the first field and only the first field (since it's a one-element range). – chepner Feb 07 '14 at 17:49
  • Okay, thanks for being nice and explaining that. Good to know how that works now. – Tastybrownies Feb 07 '14 at 17:50