Deleting duplicates in a list of lists using a criteria

Question

I have a set consisting of sets of 2 elements, the first element is still the word and the second one is the file from where the word comes from and now I need to append the name of the file to the word if the word is the same E.G. input([['word1', 'F1.txt'], ['word1', 'F2.txt'], ['word2', 'F1.txt'], ['word2', 'F2.txt'], ['word3', 'F1.txt'], ['word3', 'F2.txt'], ['word4', 'F2.txt']]) should output [['word1', 'F1.txt', 'F2.txt'], ['word2', 'F1.txt', 'F2.txt'], ['word3', 'F1.txt', 'F2.txt'], ['word4', 'F2.txt']] Can you give me some tips on how to this?

@MichaelBorne how this new example is different from the first example? — Dani Mesejo, Sep 14 '18 at 19:31
Because now the set contains words only from different files, duplicate words from the same file were removed, the list is still ordered so now I need to make a new list which contains the word only once and also the names of the files that it's been in — Michael Borne, Sep 14 '18 at 19:34
Welcome to Stack Overflow! Please do not vandalize your posts. If you believe your question is not useful or is no longer useful, it should be deleted instead of editing out all of the data that actually makes it a question. By posting on the Stack Exchange network, you've granted a non-revocable right for SE to distribute that content (under the CC BY-SA 3.0 license). By SE policy, any vandalism will be reverted. — Glorfindel, Sep 16 '18 at 11:04

Dani Mesejo · Answer 1 · 2018-09-14T19:27:53.010

4

You could use a set and the defaultdict:

from collections import defaultdict


def remove_dups_pairs(lst):
    s = set(map(tuple, lst))
    d = defaultdict(list)
    for word, file in s:
        d[word].append(file)
    return [[key] + values for key, values in d.items()]


print(remove_dups_pairs([["fire", "elem.txt"], ["fire", "things.txt"], ["water", "elem.txt"], ["water", "elem.txt"], ["water", "nature.txt"]]))

Output

[['fire', 'elem.txt', 'things.txt'], ['water', 'elem.txt', 'nature.txt']]

As @ShmulikA mentioned set does not preserve ordering, if you need to preserve ordering you can do it like this:

def remove_dups_pairs(lst):
    d = defaultdict(list)
    seen = set()
    for word, file in lst:
        if (word, file) not in seen:
            d[word].append(file)
            seen.add((word, file))

    return [[key] + values for key, values in d.items()]


print(remove_dups_pairs([["fire", "elem.txt"], ["fire", "things.txt"], ["water", "elem.txt"], ["water", "elem.txt"],
                         ["water", "nature.txt"]]))

Output

[['water', 'elem.txt', 'nature.txt'], ['fire', 'elem.txt', 'things.txt']]

edited Sep 14 '18 at 19:27

answered Sep 14 '18 at 18:49

Dani Mesejo

61,499
6
49
76

1

`for word, file in s:` does not guarantee insertion order. `for x in set('abc'): print(x)` outputs `b c a` – ShmulikA Sep 14 '18 at 19:02
I changed the condition a bit, now I have a set consisting of sets of 2 elements, the first element is still the word and the second one is the file from where the word comes from and now I need to append the name of the file to the word if the word is the same E.G. `input([['word1', 'F1.txt'], ['word1', 'F2.txt'], ['word2', 'F1.txt'], ['word2', 'F2.txt'], ['word3', 'F1.txt'], ['word3', 'F2.txt'], ['word4', 'F2.txt']])` should output `[['word1', 'F1.txt', 'F2.txt'], ['word2', 'F1.txt', 'F2.txt'], ['word3', 'F1.txt', 'F2.txt'], ['word4', 'F2.txt']]` Can you give me some tips on how to this? – Michael Borne Sep 14 '18 at 19:15
`def merge_file_dups(data): if len(data) != 0: i = 0 word = [] filenames = [data[0][1]] res = [] while i < len(data) - 1: if data[i][0] == data[i + 1][0]: filenames.append(data[i + 1][1]) else: filenames.append(data[i+1][1]) word.append(data[i][0]) word.append(filenames) res.append(word) word = [] filenames = [] i += 1 return res` – Michael Borne Sep 14 '18 at 19:21
look at this SO answer https://stackoverflow.com/a/45589593/7438048 why using set like this will not preserve the order as the OP requested – ShmulikA Sep 14 '18 at 19:22

score 2 · Accepted Answer · answered Sep 14 '18 at 20:37

Also, you can do as below if you wish not to use defaultdict:

inner=[[]]
count = 0
def loockup(data,i, count):
    for j in range(i+1, len(data)):
        if data[i][0] == data[j][0] and data[j][1] not in inner[count]:
            inner[count].append(data[j][1])
    return inner

for i in range(len(data)):
    if data[i][0] in inner[count]:
        inner=loockup(data,i,count)
    else:
        if i!=0:
            count +=1
            inner.append([])
        inner[count].append(data[i][0])
        inner[count].append(data[i][1])
        loockup(data,i, count)
print (inner)

ShmulikA · Answer 3 · 2018-09-14T19:25:44.773

keeping insertion order using set of seen items:

from collections import defaultdict

def remove_dups_pairs_ordered(lst):
    d = defaultdict(list)

    # stores word,file pairs we already seen
    seen = set()
    for item in lst:
        word, file = item
        key = (word, file)

        # skip adding word,file we already seen before
        if key in seen:
            continue
        seen.add(key)
        d[word].append(file)

    # convert the dict word -> [f1, f2..] into 
    # a list of lists [[word1, f1,f2, ...], [word2, f1, f2...], ...]
    return [[word] + files for word, files in d.items()]

print(remove_dups_pairs_ordered(lst))

outputs:

[['fire', 'elem.txt', 'things.txt'], ['water', 'elem.txt', 'nature.txt']]

without keeping the order using defaultdict & set:

from collections import defaultdict

def remove_dups_pairs(lst):
    d = defaultdict(set)

    for item in lst:
        d[item[0]].add(item[1])
    return [[word] + list(files) for word, files in d.items()]

lst = [
    ["fire","elem.txt"], ["fire","things.txt"],
    ["water","elem.txt"], ["water","elem.txt"],
    ["water","nature.txt"]
]

print(remove_dups_pairs(lst))

outputs:

   [['fire', 'things.txt', 'elem.txt'], ['water', 'nature.txt', 'elem.txt']]

I changed the condition a bit, now I have a set consisting of sets of 2 elements, the first element is still the word and the second one is the file from where the word comes from and now I need to append the name of the file to the word if the word is the same E.G. `input([['word1', 'F1.txt'], ['word1', 'F2.txt'], ['word2', 'F1.txt'], ['word2', 'F2.txt'], ['word3', 'F1.txt'], ['word3', 'F2.txt'], ['word4', 'F2.txt']])` should output `[['word1', 'F1.txt', 'F2.txt'], ['word2', 'F1.txt', 'F2.txt'], ['word3', 'F1.txt', 'F2.txt'], ['word4', 'F2.txt']]` Can you give me some tips on how to this? — Michael Borne, Sep 14 '18 at 19:15
I made a code but it's not working quite well `def merge_file_dups(data): if len(data) != 0: i = 0 word = [] filenames = [data[0][1]] res = [] while i < len(data) - 1: if data[i][0] == data[i + 1][0]: filenames.append(data[i + 1][1]) else: filenames.append(data[i+1][1]) word.append(data[i][0]) word.append(filenames) res.append(word) word = [] filenames = [] i += 1 return res` — Michael Borne, Sep 14 '18 at 19:23

siria · Answer 4 · 2018-09-15T08:04:11.860

It is possible to use an OrderedDict to solve this. It is a dictionary that allows iteration in the order by which keys were added.

import collections

def remove_dups_pairs(data):
    word_files = collections.OrderedDict()
    for word, file_name in data:
        if word not in word_files.keys():
            word_files.update({word: [file_name]})
        elif file_name not in word_files[word]:
            word_files[word].append(file_name)
    return [[word] + files for word, files in word_files.items()]


print(remove_dups_pairs([["fire", "elem.txt"], ["fire", "things.txt"],
                         ["water", "elem.txt"], ["water", "elem.txt"],
                         ["water", "nature.txt"]]))
print(remove_dups_pairs([['word1', 'F1.txt'], ['word1', 'F2.txt'],
                         ['word2', 'F1.txt'], ['word2', 'F2.txt'],
                         ['word3', 'F1.txt'], ['word3', 'F2.txt'],
                         ['word4', 'F2.txt']]))

Output:

[['fire', 'elem.txt', 'things.txt'], ['water', 'elem.txt', 'nature.txt']]
[['word1', 'F1.txt', 'F2.txt'], ['word2', 'F1.txt', 'F2.txt'], ['word3', 'F1.txt', 'F2.txt'], ['word4', 'F2.txt']]

Deleting duplicates in a list of lists using a criteria

4 Answers4

keeping insertion order using set of seen items:

without keeping the order using defaultdict & set: