How to sort a list of lines based on an nth column?

Question

In python I can sort a list like this...

lines = ["C: fish house bridge chocolate",
         "C: hamster pen flower penny",
         "C: dog park car paper",
         "C: hamster pen bus tank",
         "C: hamster lolly stick shirt"]

lines = sorted(lines)
for line in lines:
    print (line)

Gives...

C: dog park car paper
C: fish house bridge chocolate
C: hamster lolly stick shirt
C: hamster pen bus tank
C: hamster pen flower penny

I can also sort by a particular column...

lines = sorted(lines, key=lambda line: line.split()[1])
for line in lines:
    print (line)

Gives...

C: dog park car paper
C: fish house bridge chocolate
C: hamster pen flower penny
C: hamster pen bus tank
C: hamster lolly stick shirt

How can I remove lines so that the combined 2nd and 3rd columns of each line are unique?

Desired output would be...

C: dog park car paper
C: fish house bridge chocolate
C: hamster pen bus tank
C: hamster lolly stick shirt

In awk I could use something like !seen...

awk '!seen[$1][$2][$3]++'

What about in python?

What does this have to do with sorting? It seems like you want to sort *then* remove duplicates, no? — wjandrea, Apr 11 '21 at 17:09
Sounds like you want to know how to sort the lines by some combed columns. Correct? — martineau, Apr 11 '21 at 17:13
I mean, I need the 2nd plus 3rd columns to be unique. They are treated like 1 column for the purpose of uniqueness. I have no idea how to do this so I cant make an honest attempt. — Chris, Apr 11 '21 at 17:13
Related: [Removing duplicates in lists](https://stackoverflow.com/q/7961363/4518341) — wjandrea, Apr 11 '21 at 17:14
Why is `hamster pen` before `hamster lolly` in your desired output? They sort in the opposite order. — wjandrea, Apr 11 '21 at 17:18

wuerfelfreak · Answer 1 · 2021-04-11T17:50:50.657

3

I would propose the following solution:

lines = ["C: fish house bridge chocolate",
"C: hamster pen flower penny",
"C: dog park car paper",
"C: hamster pen bus tank",
"C: hamster lolly stick shirt"]

lines = sorted(lines, key=lambda line: line.split()[1])
seen = set()
for line in lines:
    key = tuple(line.split()[1:3])
    if key not in seen:
        print(line)
        seen.add(key)

This prints

C: dog park car paper
C: fish house bridge chocolate
C: hamster pen flower penny
C: hamster lolly stick shirt

The combined 2nd and 3rd columns of each line are unique but it differs from your desired output because the first unique line is kept.

edited Apr 11 '21 at 17:50

answered Apr 11 '21 at 17:08

wuerfelfreak

2,363
1
14
29

1

Note that if you do `lines = sorted(lines)`, the right lines are there in the output, but 3 and 4 are out of order compared to OP, which makes more sense IMO. – wjandrea Apr 11 '21 at 17:16
Thank you. While I am not using this solution, it inspired my own. ```seen = [] for line in lines: 2 = line.split()[1] 3 = line.split()[2] check = (2) + " " + (3) if check not in seen: print (check) seen.append(check)``` – Chris Apr 11 '21 at 17:29
@Chris Please post that as an answer. Code formatting in comments is neutered by design. (And yes, you can answer your own question. ️ ) – wjandrea Apr 11 '21 at 17:29

score 1 · Answer 2 · edited Apr 11 '21 at 17:27

1

You can take advantage of dictionary keys and use them to handle grouping your lines together.

lines = ["C: fish house bridge chocolate",
         "C: hamster pen flower penny",
         "C: dog park car paper",
         "C: hamster pen bus tank",
         "C: hamster lolly stick shirt"]

d = {''.join(line.split()[2:3]): line for line in sorted(lines, key=lambda v: v.split()[1])}
for line in d.values():
    print(line)

Output:

C: dog park car paper
C: fish house bridge chocolate
C: hamster pen bus tank
C: hamster lolly stick shirt

edited Apr 11 '21 at 17:27

wjandrea

28,235
9
60
81

answered Apr 11 '21 at 17:14

Henry Ecker

34,399
18
41
57

1

Note that dicts are only guaranteed to preserve order [in Python 3.7+ and CPython 3.6](https://stackoverflow.com/a/39980744/4518341) – wjandrea Apr 11 '21 at 17:21

Chris · Answer 3 · 2021-04-11T18:08:06.410

0

Inspired by @wuerfelfreak I decided to do it like this...

seen = []
for line in lines:
    a = line.split()[1]
    b = line.split()[2]
    check = (a) + " " + (b)
    if check not in seen:
        print (line)
        seen.append(check)

edited Apr 11 '21 at 18:08

answered Apr 11 '21 at 17:31

Chris

985
2
7
17

1

This is essentially the same as @wuerfelfreak's solution except using a list, which is less performant, i.e. O(n) lookups vs O(1) using a set. – wjandrea Apr 11 '21 at 17:51
I got a different output after fixing the variable names. Is `lines` supposed to be sorted first? – wjandrea Apr 11 '21 at 18:02

How to sort a list of lines based on an nth column?

3 Answers3