3

I extract specific names from text using regex etc. The result is a list of tuples containing titles and names. The tuples might be of a different length. lst below shows a list of possible scenarios. I need to remove duplicate names from the result. For example, ('Lord', 'Justice') == ('Lord', 'Justice', 'Smith'), and ('Lady', 'Smiles') == ('Lady', 'Justice', 'Smiles'), but ('Lord', 'Justice', 'Smith') and ('Lady', 'Justice', 'Smiles') are different names. The desired output for each element in lst should be [('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles')].

lst = [[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles')],
       [('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice')],
       [('Lord', 'Justice', 'Smith'), ('Lady', 'Smiles'), ('Lady', 'Justice', 'Smiles')],
       [('Lord', 'Justice', 'Smith'), ('Lady', 'Justice'), ('Lady', 'Justice', 'Smiles')],
       [('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lady', 'Smiles')]]

This is what I have right now but it doesn't yield the desired output. Will really appreciate your help and suggestions.

for l in lst:
    print(l)
    # remove duplicates based on the last index in tuples
    lst_1 = list(dict((v[-1],v) for v in sorted(l, key=lambda l: lst[0])).values())
    print(lst_1)
    # remove duplicates based on the second index [1] in tuples
    lst_2 = list(dict((v[1],v) for v in sorted(lst_1, key=lambda lst_1: lst_1[0])).values())    
    print(lst_2)
    print("\n")

UPDATE:

I was probably too specific in my examples. I had to include other names so the solution should work when there are other names present:

lst = [
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Smiles'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lady', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
]

Desirable output:

[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
aviss
  • 2,179
  • 7
  • 29
  • 52
  • 2
    What about `('Justice', 'Lord')` and `('Lord', 'Justice')`? Are those equal? – Thom Wiggers Jul 26 '18 at 13:47
  • This shouldn't appear in the results based on how names are extracted. The titles always come first. – aviss Jul 26 '18 at 13:49
  • If the actual intent here is for comparing version numbers, you should know that's a [solved problem](https://stackoverflow.com/q/1714027/7851115) (e.g. using tuples instead of strings). –  Jul 26 '18 at 13:50
  • Not really, I need a list of unique names I have to use further down in my pipeline. – aviss Jul 26 '18 at 13:54

2 Answers2

1

I came with this solution:

from itertools import chain, groupby

lst = [
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Smiles'), ('Lady', 'Justice', 'Smiles')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice'), ('Lady', 'Justice', 'Smiles')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lady', 'Smiles')]
]

def remove_duplicates(lst):
    rv = []
    for g, v in groupby([g for g, _ in groupby(sorted(lst))], key=lambda v: v[0]):
        rv.append(max(list(v), key=lambda v: len(v)))
    return rv


for option in lst:
    print(remove_duplicates(option))

Outputs:

[('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice', 'Smith')]
[('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice', 'Smith')]
[('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice', 'Smith')]
[('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice', 'Smith')]
[('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice', 'Smith')]
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • `lst` is a list of different options I might have when extracting names from text. So I need a script which will output the same result for all these options. – aviss Jul 26 '18 at 15:34
  • @aviss I modified my answer – Andrej Kesely Jul 26 '18 at 15:43
  • I was probably too specific in my examples. I had to include other names so the solution should work when there are other names present. I updated my question. – aviss Jul 26 '18 at 18:16
1

You can do this easily using itertools.groupby

lst = [
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Smiles'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lady', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
]
res = [[max(reversed(list(v)), key=len) for k,v in groupby(sl, lambda x: x[0])] for sl in lst]
for l in res:
    print(l)

Output

[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
Sunitha
  • 11,777
  • 2
  • 20
  • 23