-4

I asked a similar question to this earlier, but my input files were awkward to work with, so I'm asking this question again (hopefully these files will be easier to work with!) I'm trying to use python because it's what I'm trying to learn right now! (or maybe this is possible directly in the terminal?!)

Using one data set of 9701 bacteria names, I clustered them using two different programs. The output of these programs (after some manipulation) resulted in two text files, one for each program, that look something like this:

0 Pyrobaculum aerophilum Thermoproteaceae
1 Mycobacterium aichiense Mycobacteriaceae
1 Mycobacterium alvei Mycobacteriaceae
1 Mycobacterium aromaticivorans Mycobacteriaceae
1 Mycobacterium aubagnense Mycobacteriaceae
1 Mycobacterium boenickei Mycobacteriaceae
1 Mycobacterium brisbanense Mycobacteriaceae

The number corresponds to the cluster the bacteria has been placed in, followed by the actual name of the bacterium (So, above there is one bacterium in cluster '0', and six in cluster '1').

My Question: I want to compare the outputs of the two files and see if/how they sorted bacteria differently. Ideally, I would generate a new file with these differences. The catch is that the two programs go through the data the differently; so while clusters generated by the two programs may contain the same bacteria, the actual "cluster number" might be different (For example, there are ten Brucella bacteria in cluster '10', while the same ten Brucella bacteria are in cluster '2321'). For my purposes, if the same bacteria are together, but the cluster number changes, between the two cluster text files: that is NOT important. BUT, if one program had put the ten Brucella together in cluster '10', but only 9 in cluster '2321' - I'd want to know!)

So, is it possible to compare these two text files so that the actual cluster number isn't looked at, but whether the contents remain the same?

Note: it's easy to change my two cluster files into this format if it's easier to work with:

Brucella pinnipedialis Brucellaceae 0
Brucella suis Brucellaceae 0
Brucella ceti Brucellaceae 0

Or perhaps in some other way?

Jen
  • 1,141
  • 2
  • 11
  • 16
  • 2
    This is your third apparently identical question. What's wrong with the code you used to try to solve this problem yourself (you did try first, right?), or the answers on your two other questions? – Marcin Aug 13 '13 at 19:12
  • @Marcin That's what I meant about acknowledging that "I'm asking this again" in How to compare Clusters the file I was working with were very awkward and in the end not the ones I should have been working with, so I wanted to ask again with the 'right' files :P :) – Jen Aug 13 '13 at 19:16
  • Once again: What's wrong with the code you used to try to solve this problem yourself (you did try first, right?), or the answers on your two other questions? – Marcin Aug 13 '13 at 19:17
  • Unless the part you're having trouble with is simply parsing the input, the old answers should still apply. – user2357112 Aug 13 '13 at 19:20
  • Will each Genus be unique? What I mean is, say you have cluster 0, which has genus Mycobacterium. Will there be any other cluster with that genus? – Aaron Aug 13 '13 at 19:25
  • @yourfavoriteprotein Yes! For example, all the bacteria in the Mycobacterium genera are in one HUGE cluster in one file, but the other program split Mycobacterium into a few smaller ones. – Jen Aug 13 '13 at 19:27
  • Hmm, OK, I will think about this. BTW if you are a professor, I might be interested in collaboration. I'm a grad student and I had a triple bachelor's in biology/zoology/computer science – Aaron Aug 13 '13 at 19:29
  • @user2357112 That's good to know, thanks! I find it hard to look at that answer and see it being applicable to pother files (I'm still very, very new to python and coding :P But I want to learn it!) – Jen Aug 13 '13 at 19:30
  • @yourfavoriteprotein Thanks! Wow at the triple bachelors! I'm actually a student too; I've got the biology down, but not the computer science (yet!) – Jen Aug 13 '13 at 19:32
  • cool :) well let me know if I can help out with Python. I'm working on a quick solution for this question right now. email me at yourfavoriteprotein@gmail.com if you want. – Aaron Aug 13 '13 at 19:37

2 Answers2

1

Assuming each bacterium is in only one cluster, you can rename each cluster after the first (alphabetical) bacterium it contains. Identical clusters will have the same name, so you can compare directly.

Asterlune
  • 59
  • 3
  • depending on how large the files are, something like this might be the way to go if you can't read them all into memory. – Aaron Aug 13 '13 at 23:05
1

Ok, if it were me, I would try something like this:

def collector(fileIn):
    d = {}
    with open(fileIn, "r") as f:
        for line in f:
            clu, gen, spec, fam = line.split()
            d.setdefault(gen, []).append((spec, fam))
    return d

def compare_files(f1, f2):
    d1 = collector(f1)
    d2 = collector(f2)
    for genus in d1:
        try:
            if len(d1[genus]) != len(d2[genus]):
                print genus, "is different"
        except:
            print genus, "not found in file 2"

You could print out the tuples in d1 or d2 for each genus that doesn't match to see which are missing. It might also be helpful to compare the keys to see if any of the two files are missing a genus (I just assume they don't).

You could remove try/except to reduce overhead if the files are enormous

Hope that helps. Also note that I didn't save the cluster number anywhere. If that's important then maybe you could append (spec, fam, clu) to the dictionary instead.

edit typo in code

Aaron
  • 2,344
  • 3
  • 26
  • 32