1

I have a big file with entries as opened in python as:

 fh_in=open('/xzy/abc', 'r') 
 parsed_in=csv.reader(fh_in, delimiter=',')
 for element in parsed_in:
  print(element)

RESULT:

['ABC', 'chr9', '3468582', 'NAME1', 'UGA', 'GGU']

['DEF', 'chr9', '14855289', NAME19', 'UCG', 'GUC']

['TTC', 'chr9', '793946', 'NAME178', 'CAG', 'GUC']

['ABC', 'chr9', '3468582', 'NAME272', 'UGT', 'GCU']

I have to extract only the unique entries and to remove entries with same values in col1, col2 and col3. Like in this case last line is same as line 1 on the basis of col1, col2 and col3.

I have tried two methods but failed:

Method 1:

outlist=[]

for element in parsed_in:     
  if element[0:3] not in outlist[0:3]:
    outlist.append(element)

Method 2:

outlist=[]
parsed_list=list(parsed_in)
for element in range(0,len(parsed_list)):
  if parsed_list[element] not in parsed_list[element+1:]:
    outlist.append(parsed_list[element])

These both gives back all the entries and not unique entries on basis of first 3 columns.

Please suggest me a way to do so

AK

Bade
  • 747
  • 3
  • 12
  • 28
  • 3
    possible duplicate of [How do you remove duplicates from a list in Python?](http://stackoverflow.com/questions/479897/how-do-you-remove-duplicates-from-a-list-in-python) – kennytm Mar 01 '12 at 20:52
  • Not a duplicate as his list is unique based on only part of data and not the whole data set. – MitMaro Mar 01 '12 at 20:55

2 Answers2

3

You probably want to use an O(1) lookup to save yourself a full scan of the elements while adding, and like Caol Acain said, sets is a good way to do it.

What you want to do is something like:

outlist=[]
added_keys = set()

for row in parsed_in:
    # We use tuples because they are hashable
    lookup = tuple(row[:3])    
    if lookup not in added_keys:
        outlist.append(row)
        added_keys.add(lookup)

You could alternately have used a dictionary mapping the key to the row, but this would have the caveat that you would not preserve the ordering of the input, so having the list and the key set allows you to keep the ordering as in-file.

Crast
  • 15,996
  • 5
  • 45
  • 53
0

Convert your lists to sets!

http://docs.python.org/tutorial/datastructures.html#sets

  • I thought this first as well but if you read the problem closer you will see that sets won't work. Each item in the list is unique only on the first three elements on the sub lists. – MitMaro Mar 01 '12 at 21:06