1

I'm struggling to synthesise several answers I've been reading that come close to what I'm trying to do, and I can't formulate some google-fu that'll get me any existing answer! I know it's really simple but I'm at a loss now.

Very similar to this question, I want to find several strings that I've read into a tuple that occurr in another file, however, I only want the lines where each string is matched only once. So any and all don't fit the bill as far as I can tell.

What I've got so far is close, line.count is giving me back numbers of occurrences for each line, but it's wrong in 2 ways:

Firstly, line.count is under by 1 somehow for a given line?

I know I'm doing something wrong with how I'm iterating/searching each key and/or using the == 1 test, but I can't figure it out.

The tuple of strings I'm looking for is:

['AG49', 'AG51', 'AGBD', 'AGHT', 'AGJN', 'AGKC', 'AGNP', 'AGTI', 'LG01', 'LG33', 'LG45']

And some example lines of the file to search are (they will have anything from 2 to many tens of entries (OG_1000 below is actually the longest line/most members):

OG_1000: AG49|00461 AG49|03016 AG49|03395 AG49|01465 AG49|01485 AG49|02179 AG49|02513 AG49|03071 AG49|03396 AG49|02649 AG51|00302 AG51|00779 AG51|01746 AG51|02077 AG51|02502 AG51|01654 AG51|01963 AG51|01965 AGBD|01544 AGBD|02407 AGBD|02722 AGBD|03152 AGBD|02292 AGBD|03607 AGBD|03608 AGBD|03609 AGHT|00130 AGHT|00873 AGHT|00911 AGHT|01291 AGHT|02476 AGHT|02881 AGHT|02477 AGHT|02973 AGHT|02974 AGHT|02975 AGJN|00381 AGJN|00633 AGJN|01876 AGJN|02007 AGJN|02058 AGJN|02059 AGJN|02060 AGJN|02398 AGJN|02399 AGJN|02433 AGJN|02418 AGKC|00658 AGKC|00659 AGKC|00660 AGKC|01985 AGKC|02826 AGKC|02881 AGKC|01323 AGKC|01327 AGKC|01324 AGKC|02267 AGKC|02827 AGKC|02880 AGKC|04269 AGKC|02428 AGNP|00290 AGNP|02833 AGNP|03160 AGNP|03601 AGNP|03987 AGNP|03988 AGNP|03989 AGNP|04108 AGTI|00388 AGTI|01459 AGTI|03163 AGTI|03688 AGTI|00570 AGTI|04026 AGTI|03715 AGTI|03716 AGTI|03717 LG01|00908 LG01|00909 LG01|00910 LG01|01116 LG01|03323 LG01|03588 LG01|03589 LG01|03590 LG01|03591 LG01|01118 LG01|01908 LG01|03182 LG01|03189 LG01|01906 LG33|01192 LG33|01786 LG33|01787 LG33|01973 LG33|03700 LG33|04518 LG33|04759 LG33|01756 LG33|01760 LG33|01971 LG33|02055 LG33|02056 LG33|02057 LG45|00001 LG45|01508 LG45|01643 LG45|00233 LG45|00786 LG45|01599 LG45|01600 LG45|01601 LG45|04210 LG45|04212 LG45|04213 LG45|04637 LG45|03265 LG45|04211 LG45|03255 LG45|03261 AG51|00629 AGKC|04214 AG49|02651 AGBD|01546 AGKC|02430 AGNP|02835 AGTI|01461 LG45|00784 LG33|04104 LG45|00192 LG45|00193 LG33|00381 LG33|01750
OG_1082: AG49|00880 AG49|02960 AG51|02815 AG51|04137 AGNP|00113 AGNP|03735 AGTI|00006 AGTI|02047 AGBD|01827 AGHT|00357 AGJN|03158 AGKC|02788 LG01|01472 LG33|02682 LG45|01009
OG_7229: LG33|04676 LG45|01800

An example valid line would be:

OG_1264: AG49|00061 AG51|03472 AGBD|01583 AGHT|03015 AGJN|02348 AGKC|00003 AGNP|02702 AGTI|02067 LG01|00073 LG33|02222 LG45|04062

Where each string occurs only once.

My code at the moment (minus some option parsing etc):

# Get a tuple of strings to iterate over
def getKeys(nameFile):
    with open(nameFile, "r") as namehandle:
        names = []
            for line in namehandle:
                strip = line.rstrip('\n')
            names.append(strip)

    return names

# Main code:
keys = getKeys(nameFile)

matchedLines = []

with open(args.infile, "r") as clusterFile:
    for line in clusterFile:
        for key in keys:
            if line.count(key) == 1:
                matchedLines.append(line)
Community
  • 1
  • 1
Joe Healey
  • 1,232
  • 3
  • 15
  • 34
  • I think your error is appending to `matchedLines` for each key in the list where the count is 1, rather than waiting until you know they are all matches once. – Paul Rooney Nov 08 '16 at 12:15
  • Yeah I think so too. I'm not sure how to test for the strings 'all at once' and then get lines that match all of them, rather than matching each in turn and then 'reassembling' the correct lines. – Joe Healey Nov 08 '16 at 12:17

1 Answers1

2

In your code, matchedLines will have many times the same line and it still doesn't give you the lines that match all of the keys once. For that purpose, you can still use all:

with open(args.infile, "r") as clusterFile:
    matchedLines = [line for line in clusterFile if all([line.count(key) == 1 for key in keys])]
Julien Spronck
  • 15,069
  • 4
  • 47
  • 55