Python unique values per column in csv file row

Question

Crunching on this for a long time. Is there an easy way using Numpy or Pandas or fixing my code to get the unique values for the column in a row separated by "|"

I.e the data:

"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft|ft|ft","2003|207|212|212|212","qa|admin,co|master|NULL|NULL"
"2","john","doe","htw","2000","dev"

Output should be:

"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft","2003|207|212","qa|admin,co|master|NULL"
"2","john","doe","htw","2000","dev"

My broken code:

import csv
import pprint

your_list = csv.reader(open('out.csv'))
your_list = list(your_list)

#pprint.pprint(your_list)
string = "|"
cols_no=6
for line in your_list:
    i=0
    for col in line:
      if i==cols_no:
        print "\n" 
        i=0
      if string in col:
        values = col.split("|")
        myset = set(values)
        items = list()
        for item in myset:
          items.append(item)
        print items
      else:
        print col+",",
      i=i+1

It outputs:

id, fname, lname, education, gradyear, attributes, 1, john, smith, ['harvard', 'ft', 'mit']
['2003', '212', '207']
['qa', 'admin,co', 'NULL', 'master']
2, john, doe, htw, 2000, dev,

Thanks in advance!

Have a look at http://stackoverflow.com/questions/39504079/take-column-of-string-data-in-pandas-dataframe-and-split-into-separate-columns and http://stackoverflow.com/questions/39500258/pandas-how-to-get-the-unique-values-of-a-column-that-contains-a-list-of-values — danio, Sep 15 '16 at 11:30

Jon Clements · Accepted Answer · 2016-09-15T11:43:41.920

numpy/pandas is a bit overkill for what you can achieve with csv.DictReader and csv.DictWriter with a collections.OrderedDict, eg:

import csv
from collections import OrderedDict

# If using Python 2.x - use `open('output.csv', 'wb') instead
with open('input.csv') as fin, open('output.csv', 'w') as fout:
    csvin = csv.DictReader(fin)
    csvout = csv.DictWriter(fout, fieldnames=csvin.fieldnames, quoting=csv.QUOTE_ALL)
    csvout.writeheader()
    for row in csvin:
        for k, v in row.items():
            row[k] = '|'.join(OrderedDict.fromkeys(v.split('|')))
        csvout.writerow(row)

Gives you:

"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft","2003|207|212","qa|admin,co|master|NULL"
"2","john","doe","htw","2000","dev"

Julien · Answer 2 · 2016-09-15T11:55:50.147

0

If you don't care about the order when you have many items separated with |, this will work:

lst = ["id","fname","lname","education","gradyear","attributes",
"1","john","smith","mit|harvard|ft|ft|ft","2003|207|212|212|212","qa|admin,co|master|NULL|NULL",
"2","john","doe","htw","2000","dev"]

def no_duplicate(string):
    return "|".join(set(string.split("|")))

result = map(no_duplicate, lst)

print result

result:

['id', 'fname', 'lname', 'education', 'gradyear', 'attributes', '1', 'john', 'smith', 'ft|harvard|mit', '2003|207|212', 'NULL|admin,co|master|qa', '2', 'john', 'doe', 'htw', '2000', 'dev']

edited Sep 15 '16 at 11:55

answered Sep 15 '16 at 11:31

Julien

13,986
5
29
53

If you do care about the order, you can use http://stackoverflow.com/a/480227/12663 instead of set() inside no_duplicate – danio Sep 15 '16 at 11:41
Thanks for your answer – dev Sep 15 '16 at 11:45

Python unique values per column in csv file row

2 Answers2