1

I have a matrix of around 3000 species classifications e.g.

Arthropoda/Hexapoda/Insecta/Coleoptera/Cerambycidae/Anaglyptus

each line is a sequence of taxonomic classifications. What I need to do is, sort the 3000 lines so each one is unique so that the file can be fed to a program that creates phylogenetic(evolutionary) trees.

I have tried to use a set but get an error as lists are not hashable objects, however it is important to keep each line together as the values in each column for each line are nested.

Whats the best way to ensure I only have unique values in the last column but keep the integrity of each row?

many thanks

  • 2
    can you map all the lists to tuples recursively, then use `set` on the outermost? – Adam Smith Dec 11 '14 at 21:36
  • @AdamSmith But won't that lose the ordering? – Bhargav Rao Dec 11 '14 at 21:37
  • 1
    Doesn't taxonomy lend itself really nicely to nested dictionaries? That would be the solution I would look towards – Cory Kramer Dec 11 '14 at 21:37
  • 1
    @BhargavRao: No. A tuple is basically an immutable list. And tuples are hashable. – Bill Lynch Dec 11 '14 at 21:37
  • 1
    @BhargavRao My understanding was that the outermost list is unordered, but each inner list is ordered – Adam Smith Dec 11 '14 at 21:38
  • Can yu add some detail please? Perhaps the first few lines of the input file, and examples of how the should look for the program that will consume the output – kdopen Dec 11 '14 at 21:44
  • You show a string but the error is about lists... where did the list come from? Why not just use the hashable string? – tdelaney Dec 11 '14 at 21:49
  • _"the values in each column for each line are nested"_, _"ensure I only have unique values in the last column"_ - I don't understand what these requirments mean. They don't seem to have anything to do with the example provided. – tdelaney Dec 11 '14 at 21:56

2 Answers2

0

As mentioned in the comments, tuples are hashable, even though lists aren't. So let's convert your rows to tuples!

# Create the Dataset
L = []
L.append(["Arthropoda", "Hexapoda", "Insecta", "Coleoptera", "Cerambycidae", "Anaglyptus"])
L.append(["Arthropoda", "Hexapoda", "Insecta", "Coleoptera", "Cerambycidae", "Aromia"])

# Instead of a list of lists, let's have a list of tuples!
L = [tuple(x) for x in L]

# Using a set, we can easily remove duplicates
L = set(L)
Bill Lynch
  • 80,138
  • 16
  • 128
  • 173
0

The python masters may be offended but this answer is worth a try

l = []
with open('file.txt', 'r') as fp:
    for i in fp.readlines():
        if i not in l:
            l.append(i)

with open('file2.txt', 'w') as fp:
    fp.writelines(l)
Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140