I asked a similar question to this earlier, but my input files were awkward to work with, so I'm asking this question again (hopefully these files will be easier to work with!) I'm trying to use python because it's what I'm trying to learn right now! (or maybe this is possible directly in the terminal?!)
Using one data set of 9701 bacteria names, I clustered them using two different programs. The output of these programs (after some manipulation) resulted in two text files, one for each program, that look something like this:
0 Pyrobaculum aerophilum Thermoproteaceae
1 Mycobacterium aichiense Mycobacteriaceae
1 Mycobacterium alvei Mycobacteriaceae
1 Mycobacterium aromaticivorans Mycobacteriaceae
1 Mycobacterium aubagnense Mycobacteriaceae
1 Mycobacterium boenickei Mycobacteriaceae
1 Mycobacterium brisbanense Mycobacteriaceae
The number corresponds to the cluster the bacteria has been placed in, followed by the actual name of the bacterium (So, above there is one bacterium in cluster '0', and six in cluster '1').
My Question: I want to compare the outputs of the two files and see if/how they sorted bacteria differently. Ideally, I would generate a new file with these differences. The catch is that the two programs go through the data the differently; so while clusters generated by the two programs may contain the same bacteria, the actual "cluster number" might be different (For example, there are ten Brucella bacteria in cluster '10', while the same ten Brucella bacteria are in cluster '2321'). For my purposes, if the same bacteria are together, but the cluster number changes, between the two cluster text files: that is NOT important. BUT, if one program had put the ten Brucella together in cluster '10', but only 9 in cluster '2321' - I'd want to know!)
So, is it possible to compare these two text files so that the actual cluster number isn't looked at, but whether the contents remain the same?
Note: it's easy to change my two cluster files into this format if it's easier to work with:
Brucella pinnipedialis Brucellaceae 0
Brucella suis Brucellaceae 0
Brucella ceti Brucellaceae 0
Or perhaps in some other way?