0

I have two files, the first file is a list of item with the items listed one per line. The second file is a tsv file with many items listed per line. So, some lines in the second file have items that might be listed in the first file. I need to generate a list of lines from the second file that might have items listed in the first file.

grep -f is being finicky for me so I decided to make my own python script. This is what I came up with:-

Big list is the second file, tiny list is the first file.

def main():
    desired_subset = []
    small_list = open('tiny_list.txt','r')
    big_list = open('big_list.tsv','r')
    for i in small_list.readlines():
        i = i.rstrip('\n')
        for big_line in big_list:
            if i in big_line:
                if i not in desired_subset:
                    desired_subset.append(big_line)
    print(desired_subset)
    print(len(desired_subset))

 
main()

 

The problem is that the for loop is only reading through the first line. Any suggestions?

  • Does this answer your question? [Re-read an open file Python](https://stackoverflow.com/questions/17021863/re-read-an-open-file-python) – Nicholas Hunter Apr 29 '21 at 20:37

1 Answers1

0

When you iterate over file (here over big_list) you "consume it, so that on the second iteration of small_list you don't have anything left in big_list. Try reading big_list with .readlines() into the list variable before the main for loop and use that:

def main():
    desired_subset = []
    small_list = open('tiny_list.txt','r')
    big_list = open('big_list.tsv','r').readlines() # note here
    for i in small_list.readlines():
        i = i.rstrip('\n')
        for big_line in big_list:
            if i in big_line:
                if i not in desired_subset:
                    desired_subset.append(big_line)
    print(desired_subset)
    print(len(desired_subset))

Also, you don't close your files which is a bad practice. I'd suggest to use context manager (open files with with statement):

def main():
    desired_subset = []
    with open('tiny_list.txt','r') as small_list,
         open('big_list.tsv','r') as big_list:

         small_file_lines = small_list.readlines()
         big_file_lines = big_list.readlines()

    for i in small_file_lines:
        i = i.rstrip('\n')
        for big_line in big_file_lines:
            if i in big_line:
                if i not in desired_subset:
                    desired_subset.append(big_line)

    print(desired_subset)
    print(len(desired_subset))
Yevhen Kuzmovych
  • 10,940
  • 7
  • 28
  • 48