0

Input:

line1 a gh
line2 a dd
line3 c dd
line4 a gg
line5 b ef

Desired output:

line3 c dd
line5 b ef

That is, I want to output line only in the case that no other line includes the same value in column 2. I thought I could do this with combination of sort (e.g. sort -k2,2 input) and uniq, but it appears that with uniq I can only skip columns from the left (-f avoid comparing the first N fields). Surely there's some straightforward way to do this with awk or something.

5heikki
  • 152
  • 9
  • 2
    What have you tried? Most of us here are happy to help you improve your craft, but are less happy acting as short order unpaid programming staff. Show us your work so far in an [MCVE](http://stackoverflow.com/help/mcve), the result you were expecting and the results you got, and we'll help you figure it out. – ghoti Mar 10 '16 at 12:40

4 Answers4

3

You can do this as a two-pass awk script:

awk 'NR==FNR{a[$2]++;next} a[$2]<2' file file

This runs through the file once incrementing a counter in an array whose key is the second field of each line, then runs through a second time printing only those lines whose counter is less than 2.

You'd need multiple reads of the file because at any point during the first read, you can't possibly know whether there will be another instance of the second field of that line later in the file.

ghoti
  • 45,319
  • 8
  • 65
  • 104
2

Here is a one pass awk solution:

awk '{a1[$2]++;a2[$2]=$0} END{for (a in a1) if (a1[a]==1) print a2[a]}' file

The original order of the file will be lost however.

dawg
  • 98,345
  • 23
  • 131
  • 206
  • Not *exactly* one pass, is it? First pass reads the file from disk, second pass reads the file from memory. – Graham Mar 17 '16 at 15:50
  • @Graham: As opposed to `awk '{actions}' file file` that would be the other way to do it. – dawg Mar 17 '16 at 17:12
1

You can combine awk, grep, sort and uniq for a quick one-liner:

grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d) " input.txt

Edit, to avoid the regexes, \+ and \backreferences:

grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d | sed 's/[^+0-9]/\\&/g') " input.txt

  • I never saw grep being used like that before. What does the ^[^ ]* part do? – 5heikki Mar 10 '16 at 12:42
  • @5heikki The first `^` anchors the regex to the front of the line and the second `^` matches everything that is not a space. –  Mar 10 '16 at 12:48
  • So in this case this would be equivalent **grep -v "^[^ ]* $(echo a)" input**, however, even if there's a 4th column consisting of nothing but a's it still works. I just don't understand why.. – 5heikki Mar 10 '16 at 12:54
  • @5heikki Because the regex is specifically targeting the second column, it matches non-white space up to whitespace then the duplicated patterns then a whitespace. That brings it to the second column and `grep -v` inverts the match. –  Mar 10 '16 at 12:58
  • Ah ok, now I get it. The white space before the closing double quotation mark wasn't a typo then. Accepting this answer because it was posted first. ghoti's answer works just as well though.. – 5heikki Mar 10 '16 at 13:04
  • 2
    You will find that @ghoti's answer continues to work while this one fails if/when the text you are matching on contains RE metacharacters. That might be fine for your data, idk, but in general an answer that produces the output you expect from a given sample input is the starting point for identifying a solution, not the end point - you have to really THINK about what each answer is actually doing. – Ed Morton Mar 10 '16 at 13:19
  • @Ed Morton Can you give me an example of what you are talking about? –  Mar 10 '16 at 13:24
  • I think he means that if in column 2 there's something like "*" instead of a letter.. – 5heikki Mar 10 '16 at 13:27
  • @ 5heikki Yea it sounds like it but for * and those cases it works, I've just tried a few to see if I could break my solution and didn't find any. (E.g. `line3 [^c]* dd`, `line3 [^c]* dd`) –  Mar 10 '16 at 13:30
  • The metachar has to be on a line that the `awk | sort | uniq` would output so it's used by the surrounding `grep`. Try replacing the `a`s in the 2nd column with `.`s and you'll find the script outputs nothing since the grep ends up being `grep -v "^[^ ] . "`. With other metachars you'll get the same or syntax errors or other side-effects. The core difference is that @ghoti's solution is doing a string comparison while this one is doing a regexp comparison. – Ed Morton Mar 10 '16 at 13:36
  • @Ed Morton Thanks. @5heikki If you have those types of characters you can make sure to escape the grep'd column: `grep -v "^[^ ]* $(awk '{print $2}' input.txt|sort|uniq -d|sed 's/./\\&/g') " input.txt` –  Mar 10 '16 at 13:43
  • You're welcome. You can't just escape them though as escaping some characters turns them into metacharacters, e.g. if the input had a `+` that's not a BRE metacharacter so grep will treat it literally but `\+` becomes a metacharacter since escaping it activates it's ERE metachar property of 1-or-more repetitions. You might fare better wrapping each character inside a bracket expression `sed 's/./[&]/g'`, idk. Personally I'd just use awk instead of grep since awk understands strings and fields. – Ed Morton Mar 10 '16 at 13:47
  • @Ed Right, I don't know the list but it should be finite you can add `+` and any other special meta modifier characters to the bracket list of inverted matches: `grep -v "^[^ ]* $(awk '{print $2}' input.txt|sort|uniq -d|sed 's/[^+]/\\&/g') " input.txt` –  Mar 10 '16 at 13:56
  • And to avoid grep backreferences: `grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d | sed 's/[^+0-9]/\\&/g') " input.txt` –  Mar 10 '16 at 14:07
  • See http://stackoverflow.com/q/29613304/1745001 for how to escape metacharacters if you really want to go that route rather than just using string comparisons. That question was about sed but a lot of it applies to grep too. – Ed Morton Mar 10 '16 at 14:13
  • 1
    Lol thanks Ed, I actually provided a comment on that post last year. –  Mar 10 '16 at 14:17
1

alternative to awk to demonstrate that it can still be done with sort and uniq (there is option -u for this), however setting up the right format requires some juggling (decorate/do stuff/undecorate pattern).

$ paste file <(cut -d' ' -f2 file) | sort -k2 | uniq -uf3 | cut -f1

line5 b ef
line3 c dd

as a side effect you lose the original sorting order, which can be recovered as well if you add line numbers...

karakfa
  • 66,216
  • 7
  • 41
  • 56