Python: Create set by removing duplicates in text processing?

Question

Let's say a text file with two columns like below

A "
A "
A l
A "
C r
C "
C l
D a
D "
D "
D "
D d
R "
R "
R "
R " 
S "
S "
S o
D g
D "
D "
D "
D j
A "
A "
A z

I would like retrieve the information like below

list1= {A:l}, {C:r,l}, {D:a,d}, {S:o}
final_list= {A:l}, {C:r,l}, {D:a,d}, R{}, {S:o}

I understand that , I have to access the text file line.strip().split()

and after that I don't know how to proceed.

Cannot understand the logic. Why `{D: a}` was skipped?. What have you done so far? — awesoon, Feb 11 '16 at 13:57
Sorry, I missed it. Updated the question . Will update the script which I tried. — Rangooski, Feb 11 '16 at 13:59
since you've got two dictionaries that you want, `list1` and `final_list`, perhaps working on both at the same time? — dwanderson, Feb 11 '16 at 14:01
@dwanderson : yes, list1 will be used compared with some other dictionaries. — Rangooski, Feb 11 '16 at 14:04

dwanderson · Answer 1 · 2016-02-11T14:36:01.483

1

import collections
list1 = collections.defaultdict(set)
final_list = collections.defaultdict(set)
for line in filetext: ## assuming youve opened it, read it in
    key, value = line.strip().split()
    final_list[key].add(value)
    if value != '"':
        list1[key].add(value)

This is slightly different in that final_list will have the empty string as an element; this doesn't match what you said, so let's alter it a little:

import collections
list1 = collections.defaultdict(set)
final_list = {}
for line in filetext: ## assuming youve opened it, read it in
    key, value = line.strip().split()
    if key not in final_list:
        final_list[key] = set()
    if value != '"':
        list1[key].add(value)
final_list.update(list1)

This should give you what you want - existence with empty-sets for things like R.

edited Feb 11 '16 at 14:36

answered Feb 11 '16 at 14:03

dwanderson

2,775
2
25
40

In the second answer the second if loops shows indentation error. – Rangooski Feb 11 '16 at 14:25
Which line has an indentation error? I intentionally put `final_list.update` after all the lines because you only need to do it once, at the end of the file. If it's something else, just let me know and I'll fix it – dwanderson Feb 11 '16 at 14:26
`import collections list1 = collections.defaultdict(set) final_list = {} with open('test.txt', 'r') as f: for line in f: ## assuming youve opened it, read it in key, values = line.strip().split() if key not in final_list: final_list[key] = set() if values: list1[key].add(values) final_list.update(list1) print(list1) print(final_list)` – Rangooski Feb 11 '16 at 14:28
final_list = `{'A': {'"'}, 'C': {'r'}, 'D': {'a'}, 'R': {'"'}, 'S': {'"'}}`, – Rangooski Feb 11 '16 at 14:31
final_list = `{'A': {'"'}, 'C': {'r'}, 'D': {'a'}, 'R': {'"'}, 'S': {'"'}}` & list1 = `defaultdict(set, {'A': {'"'}, 'C': {'r'}, 'D': {'a'}, 'R': {'"'}, 'S': {'"'}})` – Rangooski Feb 11 '16 at 14:33
Oh, instead of `if value:` do `if value != '"':`; I'll fix that – dwanderson Feb 11 '16 at 14:35
1

But your `final_list` is actually a dictionary :) Don't do that. – Alex Belyaev Feb 11 '16 at 14:42
Oh yeah, better variable names, for sure. Just sticking with OP so it's easier for them to follow, but I agree with you – dwanderson Feb 11 '16 at 14:46
@alex Belyaev Oh yes, I forgot that. Shall I make the Final_list as List after ? Is it possible ? – Rangooski Feb 11 '16 at 14:51

score 1 · Answer 2 · edited May 23 '17 at 12:23

1

In case if order of dicts in final_list DOESN'T matter:

from collections import defaultdict

with open('/home/bwh1te/projects/stackanswers/wordcount/data.txt') as f:
    occurencies = defaultdict(list)
    for line in f:
        key, value = line.strip().split()
        # invoke of occurencies[key] in this condition
        # cause autocreating of this key in dict
        if value not in occurencies[key] and value.isalpha(): 
            occurencies[key].append(value)

# defaultdict(<class 'list'>, {'C': ['r', 'l'], 'D': ['a', 'd'], 'S': ['o'], 'A': ['l'], 'R': []})
# Use it like a simple dictionary

# In case if it must be a list, not a dict:
final_list = [{key: value} for key, value in occurencies.items()]
# [{'C': ['r', 'l']}, {'D': ['a', 'd']}, {'S': ['o']}, {'A': ['l']}, {'R': []}]

In case if order of dicts in final_list DOES matter:

from collections import OrderedDict

with open(file_path) as f:
    occurencies = OrderedDict()
    for line in f:
        key, value = line.strip().split()
        # Create each key anyway
        if key not in occurencies:
            occurencies[key] = []        
        if value.isalpha():
            if value not in occurencies[key]:
                occurencies[key].append(value)

# OrderedDict([('A', ['l']), ('C', ['r', 'l']), ('D', ['a', 'd']), ('R', []), ('S', ['o'])])

# In case if it must be a list, not a dict
final_list = [{key: value} for key, value in occurencies.items()]
# [{'A': ['l']}, {'C': ['r', 'l']}, {'D': ['a', 'd']}, {'R': []}, {'S': ['o']}]

list1 = [{key: value} for key, value in occurencies.items() if value]
# [{'A': ['l']}, {'C': ['r', 'l']}, {'D': ['a', 'd']}, {'S': ['o']}]

Or you can implement hybrid of OrderedDict and defauldict like that: Can I do an ordered, default dict in Python? :)

edited May 23 '17 at 12:23

Community

1
1

answered Feb 11 '16 at 14:37

Alex Belyaev

1,417
1
11
15

The order matters here.I will be comparing the `for all in list1` I will compare `final_list [-1]` & `final_list [1]`. – Rangooski Feb 11 '16 at 14:41
@Rangooski Okay... It should preserve order of file records or sort alphabetically? – Alex Belyaev Feb 11 '16 at 14:45
it should be in order of file records. Not alphabetically. – Rangooski Feb 11 '16 at 14:46
Thank you so much for the answer. I will try learn the this concept of Ordered Dict. – Rangooski Feb 11 '16 at 15:03
final_list dint give the expected result. It gives `final_list = [{'A': ['l']}, {'C': ['r', 'l']}, {'D': ['a', 'd']}, {'S': ['o']}]` – Rangooski Feb 11 '16 at 15:33
@Rangooski emm... what is the difference? There should be an 'empty' dict for 'R'? – Alex Belyaev Feb 12 '16 at 10:31
But the `list1` is not here . `final_list = [{'A': ['l']}, {'C': ['r', 'l']}, {'D': ['a', 'd']}, {'R': []}, {'S': ['o']}]` `occurencies = OrderedDict([('A', ['l']), ('C', ['r', 'l']), ('D', ['a', 'd']), ('R', []), ('S', ['o'])])` – Rangooski Feb 12 '16 at 10:47
Oh, it wasn't clear... I thought that `list1` is some temporary list to create a `final_list` :) And what do you expect to see in `list1`? – Alex Belyaev Feb 12 '16 at 10:50
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/103293/discussion-between-rangooski-and-alex-belyaev). – Rangooski Feb 12 '16 at 12:06

Python: Create set by removing duplicates in text processing?

2 Answers2