-2

In python I can sort a list like this...

lines = ["C: fish house bridge chocolate",
         "C: hamster pen flower penny",
         "C: dog park car paper",
         "C: hamster pen bus tank",
         "C: hamster lolly stick shirt"]

lines = sorted(lines)
for line in lines:
    print (line)

Gives...

C: dog park car paper
C: fish house bridge chocolate
C: hamster lolly stick shirt
C: hamster pen bus tank
C: hamster pen flower penny

I can also sort by a particular column...

lines = sorted(lines, key=lambda line: line.split()[1])
for line in lines:
    print (line)

Gives...

C: dog park car paper
C: fish house bridge chocolate
C: hamster pen flower penny
C: hamster pen bus tank
C: hamster lolly stick shirt

How can I remove lines so that the combined 2nd and 3rd columns of each line are unique?

Desired output would be...

C: dog park car paper
C: fish house bridge chocolate
C: hamster pen bus tank
C: hamster lolly stick shirt

In awk I could use something like !seen...

awk '!seen[$1][$2][$3]++'

What about in python?

martineau
  • 119,623
  • 25
  • 170
  • 301
Chris
  • 985
  • 2
  • 7
  • 17
  • 1
    What does this have to do with sorting? It seems like you want to sort *then* remove duplicates, no? – wjandrea Apr 11 '21 at 17:09
  • Sounds like you want to know how to sort the lines by some combed columns. Correct? – martineau Apr 11 '21 at 17:13
  • I mean, I need the 2nd plus 3rd columns to be unique. They are treated like 1 column for the purpose of uniqueness. I have no idea how to do this so I cant make an honest attempt. – Chris Apr 11 '21 at 17:13
  • 1
    Related: [Removing duplicates in lists](https://stackoverflow.com/q/7961363/4518341) – wjandrea Apr 11 '21 at 17:14
  • 1
    Why is `hamster pen` before `hamster lolly` in your desired output? They sort in the opposite order. – wjandrea Apr 11 '21 at 17:18

3 Answers3

3

I would propose the following solution:

lines = ["C: fish house bridge chocolate",
"C: hamster pen flower penny",
"C: dog park car paper",
"C: hamster pen bus tank",
"C: hamster lolly stick shirt"]

lines = sorted(lines, key=lambda line: line.split()[1])
seen = set()
for line in lines:
    key = tuple(line.split()[1:3])
    if key not in seen:
        print(line)
        seen.add(key)

This prints

C: dog park car paper
C: fish house bridge chocolate
C: hamster pen flower penny
C: hamster lolly stick shirt

The combined 2nd and 3rd columns of each line are unique but it differs from your desired output because the first unique line is kept.

wuerfelfreak
  • 2,363
  • 1
  • 14
  • 29
  • 1
    Note that if you do `lines = sorted(lines)`, the right lines are there in the output, but 3 and 4 are out of order compared to OP, which makes more sense IMO. – wjandrea Apr 11 '21 at 17:16
  • Thank you. While I am not using this solution, it inspired my own. ```seen = [] for line in lines: 2 = line.split()[1] 3 = line.split()[2] check = (2) + " " + (3) if check not in seen: print (check) seen.append(check)``` – Chris Apr 11 '21 at 17:29
  • @Chris Please post that as an answer. Code formatting in comments is neutered by design. (And yes, you can answer your own question. ️ ) – wjandrea Apr 11 '21 at 17:29
1

You can take advantage of dictionary keys and use them to handle grouping your lines together.

lines = ["C: fish house bridge chocolate",
         "C: hamster pen flower penny",
         "C: dog park car paper",
         "C: hamster pen bus tank",
         "C: hamster lolly stick shirt"]

d = {''.join(line.split()[2:3]): line for line in sorted(lines, key=lambda v: v.split()[1])}
for line in d.values():
    print(line)

Output:

C: dog park car paper
C: fish house bridge chocolate
C: hamster pen bus tank
C: hamster lolly stick shirt
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • 1
    Note that dicts are only guaranteed to preserve order [in Python 3.7+ and CPython 3.6](https://stackoverflow.com/a/39980744/4518341) – wjandrea Apr 11 '21 at 17:21
0

Inspired by @wuerfelfreak I decided to do it like this...

seen = []
for line in lines:
    a = line.split()[1]
    b = line.split()[2]
    check = (a) + " " + (b)
    if check not in seen:
        print (line)
        seen.append(check)
Chris
  • 985
  • 2
  • 7
  • 17
  • 1
    This is essentially the same as @wuerfelfreak's solution except using a list, which is less performant, i.e. O(n) lookups vs O(1) using a set. – wjandrea Apr 11 '21 at 17:51
  • I got a different output after fixing the variable names. Is `lines` supposed to be sorted first? – wjandrea Apr 11 '21 at 18:02