python: extract items of different lists and put them in one set

Question

I have a file like this:

93.93.203.11|["['vmit.it', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'maurominnella.com']"]
168.144.9.16|["['iipmalumni.com','webdesignhostingindia.com', 'iipmstudents.in', 'iipmclubs.in']"]
195.211.72.88|["['tcmpraktijk-jingshen.nl', 'ellen-siemer.nl'']"]
129.35.210.118|["['israelinnovation.co.il', 'watec-peru.com', 'bsacimeeting.org', 'wsava2015.com', 'picsmeeting.com']"]

I want to extract domains in all the lists and add them to one set. ultimately, i would like to have a fine with each unique domain in one line. Here is the code I have written:

set_d = set()
f = open(file,'r')
for line in f:
    line = line.strip('\n')
    ip,list = line.split('|')
    l = json.loads(list)
    for e in l:
        domain = e.split(',')
        set_d.add(domain)
        print set_d

but it gives the below error:

    set_d.add(domain)
TypeError: unhashable type: 'list'

Can anybody help me out?

Maybe this is helpful: http://stackoverflow.com/questions/1306631/python-add-list-to-set — Meng Wang, Feb 14 '15 at 20:19

score 1 · Answer 1 · answered Feb 14 '15 at 20:23

1

You should call update instead of add;

set_d.update(domain)

Example;

>>> set_d = {'a', 'b', 'c'}
>>> set_d.update(['c', 'd', 'e'])
>>> print set_d
{'a', 'b', 'c', 'd', 'e'}

answered Feb 14 '15 at 20:23

Ozgur Vatansever

49,246
17
84
119

Padraic Cunningham · Accepted Answer · 2015-02-14T21:28:35.957

Use str.translate to clean the text and add to the set using update:

set_d = set()
with open(file,'r') as f:
    for line in f:
       lst = (x.strip() for x in line.split("|")[1].translate(None,"\"'[]").split(","
        set_d.update(lst)

outputs a unique set of individual domains:

set(['vmit.it', 'tcmpraktijk-jingshen.nl', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'watec-peru.com', 'bsacimeeting.org', 'webdesignhostingindia.com', 'wsava2015.com', 'iipmstudents.in', 'maurominnella.com', 'ellen-siemer.nl', 'picsmeeting.com', 'iipmalumni.com', 'iipmclubs.in', 'israelinnovation.co.il'])

which you can write to a new file:

set_d = set()
with open(file,'r') as f,open("out.txt","w") as out:
    for line in f:
        lst = (x.strip() for x in line.split("|")[1].translate(None,"\"'[]").split(","))
        set_d.update(lst)
    for line in set_d:
        out.write("{}\n".format(line))

The output:

$ cat out.txt 
vmit.it
tcmpraktijk-jingshen.nl
umbertominnella.it
studioguizzardi.it
telestreet.it
watec-peru.com
bsacimeeting.org
webdesignhostingindia.com
wsava2015.com
iipmstudents.in
maurominnella.com
ellen-siemer.nl
picsmeeting.com
iipmalumni.com
iipmclubs.in
israelinnovation.co.il

Your code will not separate into individual domains, your json call does not really do anything to help. Changing your code to update will output something like the following:

{" 'maurominnella.com']", " 'wsava2015.com'", "'webdesignhostingindia.com'", " 'iipmclubs.in']", " 'ellen-siemer.nl'']", " 'umbertominnella.it'", " 'picsmeeting.com']", "['israelinnovation.co.il'", "['vmit.it'", " 'iipmstudents.in'", "['tcmpraktijk-jingshen.nl'", " 'studioguizzardi.it'", "['iipmalumni.com'", " 'watec-peru.com'", " 'bsacimeeting.org'", " 'telestreet.it'"}

Also don't use list as a variable name either it shadows the python list

Mazdak · Answer 3 · 2015-02-14T21:12:48.510

0

As the result of split function is a list (domain = e.split(','))and lists are unhashable you cant add them to set . instead you can add those elements to your set with set.update() , But you dont need Json as it doesn't separate your domain and doesn't give you the desire result instead you can use ast.literal_eval to split your list :

import ast
set_d = set()
f = open(file,'r')
for line in f:
    line = line.strip('\n')
    ip,li = line.split('|')
    l = ast.literal_eval(ast.literal_eval(li)[0])
    for e in l:
        domain = e.split(',')
        set_d.update(domain)
    print set_d

Note that dont use of python built-in functions or types as your variable!

And as a more efficient way you just can use regex to grub your domains :

f = open(file,'r').read()
import re
print set(re.findall(r'[a-zA-Z\-]+\.[a-zA-Z]+',f))

result:

set(['vmit.it', 'tcmpraktijk-jingshen.nl', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'israelinnovation.co', 'bsacimeeting.org', 'webdesignhostingindia.com', 'iipmstudents.in', 'maurominnella.com', 'ellen-siemer.nl', 'picsmeeting.com', 'watec-peru.com', 'iipmalumni.com', 'iipmclubs.in'])
[Finished in 0.0s]

edited Feb 14 '15 at 21:12

answered Feb 14 '15 at 20:22

Mazdak

105,000
18
159
188

Thanks Kasra. How can I get read of the json object notation, "u"? – UserYmY Feb 14 '15 at 20:28
@Mee You need to encode to ascii , check out the edit – Mazdak Feb 14 '15 at 20:35
@PadraicCunningham i just resolve the OP's current issue ! if OP have another problem that tell about that i will remove those too! also i add another way with regex – Mazdak Feb 14 '15 at 20:52
@PadraicCunningham whats the problem with ignore ? OP just want to remove `u` – Mazdak Feb 14 '15 at 20:55
@PadraicCunningham OK, fixed! – Mazdak Feb 14 '15 at 21:13
@PadraicCunningham i use 2 time of literal _eval at first i choose the first element of list that is a string list!!! then convert to list again!also i think literal_eval is for this purpose and is more efficient – Mazdak Feb 14 '15 at 21:17

python: extract items of different lists and put them in one set

3 Answers3