0

I'm new to Python and trying to do the following. I have a csv file like below, (input.csv)

a,v,s,f
china,usa,china and uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand

where I'd like to remove duplicates with respect to each row to get the below.

a,v,s,f (output.csv)
china,usa, and uk,france
india,australia,usa,uk
japan,south africa,,new zealand

Notice that though 'usa' is repeated in two different rows, it still is kept intact, unlike 'china' and 'japan', which are repeated in same rows.

I tried doing using OrderedDict from collections in the following way

from collections import OrderedDict
out = open ("output.csv","w")
items = open("input.csv").readlines()
print >> out, list(OrderedDict.fromkeys(items))

but it moved all the data into one single row

abn
  • 1,353
  • 4
  • 28
  • 58
  • readlines() is going to read all of the lines from the file when used with no argument. probably should use readline() instead to read one line at a time, and do that for each line until you are done. – James H Nov 11 '14 at 04:32
  • @JamesH Thank you for the response. By `do that for each line until you are done`, do you mean that I should rerun the script manually for each line? – abn Nov 11 '14 at 04:34
  • Look at ray.dino's answer. That's a great solution that actually works fine with readlines() but does logically the same thing that I was getting at. – James H Nov 11 '14 at 04:35

2 Answers2

0

This can actually be asked more specifically as, "How to remove duplicate items from lists." For which there's an existing solution: Removing duplicates in lists

So, assuming that your CSV file looks like this: items.csv

a,v,s,f
china,usa,china,uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand

I intentionally changed "china and uk" in line 2 to "china,uk". Note below.

Then the script to remove duplicates could be:

import sys
with open('items.csv', 'r') as csv:
    for line in csv.readlines():
        print list(set(line.split(',')))

Note: Now, if the 2nd really does contain "china and uk", you'd have to do something different than processing the file as a CSV.

Community
  • 1
  • 1
ray.dino
  • 71
  • 1
  • 5
  • Good clean "Python-ey" solution – James H Nov 11 '14 at 04:34
  • 1
    You might face problems in cases such : "china,usa,china and uk,france" + when u use set, r reduce the position. which means that you r moving them to different columns to make the last one emty. – user3378649 Nov 11 '14 at 04:45
0

we might hurt the dataset while iterating rows and deleting items without caring the related original position. There is related index (Column/Rows) to every item, deleting it can move the next items to other position.

Try to use pandas in such scenarios. by selecting items in the same row, you can apply a function to re-construct the row respecting their position. We use in operator to deal with such scenarios china and uk, and we replace the duplicated values with a an empty str.

 def trans(x):
        d=[y for y in x]
        i=0
        while i<len(d):
            j=i+1
            item=d[i]
            while j<len(d):
                if item in d[j]: 
                    d[j]=d[j].replace(item,'')
                j+=1
            i+=1
        return d

Your code would look like:

import pandas as pd
from io import StringIO


data="""a,v,s,f
china,usa,china and uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand"""
df= pd.read_csv(StringIO(data.decode('UTF-8')) )


from collections import Counter
def trans(x):
    d=[y for y in x]
    i=0
    while i<len(d):
        j=i+1
        item=d[i]
        while j<len(d):
            if item in d[j]: 
                d[j]=d[j].replace(item,'')
            j+=1
        i+=1
    return d

print df.apply(lambda x:trans(x),axis=1 )


       a             v        s            f
0  china           usa   and uk       france
1  india     australia      usa           uk
2  japan  south africa           new zealand

In order to read your csv file, you just need to replace the name. More details should be found here

 df= pd.read_csv("filename.csv")