Python compare every line in file with all others

Question

I am implementing a statistical program and have created a performance bottleneck and was hoping that I could obtain some help from the community to possibly point me in the direction of optimization.

I am creating a set for each row in a file and finding the intersection of that set by comparing the set data of each row in the same file. I then use the size of that intersection to filter certain sets from the output. The problem is that I have a nested for loop (O(n²)) and the standard size of the files incoming into the program are just over 20,000 lines long. I have timed the algorithm and for under 500 lines it runs in about 20 minutes but for the big files it takes about 8 hours to finish.

I have 16GB of RAM at disposal and a significantly quick 4-core Intel i7 processor. I have noticed no significant difference in memory use by copying the list1 and using a second list for comparison instead of opening the file again(maybe this is because I have an SSD?). I thought the 'with open' mechanism reads/writes directly to the HDD which is slower but noticed no difference when using two lists. In fact, the program rarely uses more than 1GB of RAM during operation.

I am hoping that other people have used a certain datatype or maybe better understands multiprocessing in Python and that they might be able to help me speed things up. I appreciate any help and I hope my code isn't too poorly written.

import ast, sys, os, shutil
list1 = []
end = 0
filterValue = 3

# creates output file with filterValue appended to name
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
    with open(arg2 + arg1 + "/file", "r") as infile:
        # create a list of sets of rows in file
        for row in infile:
            list1.append(set(ast.literal_eval(row)))

            infile.seek(0)
            for row in infile:
                # if file only has one row, no comparisons need to be made
                if not(len(list1) == 1):
                # get the first set from the list and...
                    set1 = set(ast.literal_eval(row))
                    # ...find the intersection of every other set in the file
                    for i in range(0, len(list1)):
                        # don't compare the set with itself
                        if not(pos == i):
                            set2 = list1[i]
                            set3 = set1.intersection(set2)
                            # if the two sets have less than 3 items in common
                            if(len(set3) < filterValue):
                                # and you've reached the end of the file
                                if(i == len(list1)):
                                    # append the row in outfile
                                    outfile.write(row)
                                    # increase position in infile
                                    pos += 1
                            else:
                                break
                        else:
                            outfile.write(row)

Sample input would be a file with this format:

[userID1, userID2, userID3]
[userID5, userID3, userID9]
[userID10, userID2, userID3, userID1]
[userID8, userID20, userID11, userID1]

The output file if this were the input file would be:

[userID5, userID3, userID9]
[userID8, userID20, userID11, userID1]

...because the two sets removed contained three or more of the same user id's.

If your RAM is significantly larger than your file, the OS *may* cache the file in RAM. It is impossible to control this in any reasonable way. It happens automatically without Python asking. — Kevin, Feb 13 '16 at 06:51
Probably won't make a difference but, why are you opening /file twice? — madprops, Feb 13 '16 at 06:53
You may not need to read the file twice. You should take a look at itertools.permutations to get the comparison sets from the first list (https://docs.python.org/2/library/itertools.html#itertools.permutations) ` — Obsidian, Feb 13 '16 at 06:57
You can also submit this question to http://codereview.stackexchange.com/ — Selcuk, Feb 13 '16 at 07:03
Okay, I seeked to beginning of file instead of reopening but I guess the question still remains, how to process large files by comparing each line to itself without creating a huge bottleneck. — Kenneth Orton, Feb 13 '16 at 09:16
Could you please explain what your algorithm does? Not quite obvious. Btw, looks like this branch `if(len(set3) < filterValue):` could have `else: continue`, cause once skipped `end` cannot reach zero for that value of `set1`. — Serge Seredenko, Feb 13 '16 at 09:32
Is it possible to reduce the `O(n^2)` complexity, e.g. by sorting the sets first? I don't know how your data look like, so I'm afraid I cannot give more detailed advise at this point. But usually, it's the algorithm that you need to ameliorate. — WhatsUp, Feb 13 '16 at 11:34
@SergeSeredenko you're right except it should break the inner loop not continue because if the intersection of one set in the file is greater than three then it will never satisfy the condition to write to the output file. That only reduces the speed of the algorithm by a tiny percentage though because most set intersections for the data are less than 3. But it is the proper way and I updated the code. Thanks! — Kenneth Orton, Feb 13 '16 at 13:45
Yeah, break, sorry. I don't understand the example you added. Can you please explain how the rows are bad or how they are good, because in your example all rows have max 1-element intersection. Also, the code in your post works? — Serge Seredenko, Feb 13 '16 at 14:05
@SergeSeredenko The rows aren't bad or good they are being filtered. So, take one row in file and compare it to all other rows in file as sets. The intersection of each comparison will be set1 intersect set2. If the sets have 3 or more elements in common, then do not write them to output. So in my example input, you can see that sets(rows) 1 and 3 will not be written because they have 3 of the same(intersecting) user ids. Sets are a fundamental idea of discrete mathematics. I guess it helps to know how sets work, otherwise it's pretty, pretty, clear what's going on. — Kenneth Orton, Feb 13 '16 at 15:29
@root-11 I added a picture because it seems clear to me what is going on. I literally commented the hell out of the above code. I don't think my intention was to get an answer about the code, more about how to compare each line in a file with every other line the same file without using a nested loop. I think I found my answer and that is to use multiprocessing in Python and chunk the incoming file. Similar to mapreduce but not quite. — Kenneth Orton, Feb 13 '16 at 15:47
@KennethOrton, is the data in the rows ordered? Also why are you using literal_eval? That seems to be adding a lot of unnecessary overhead. — Padraic Cunningham, Feb 13 '16 at 16:18
Ouch that nesting! Firstly, if memory is not an issue, why don't you read the file into memory first, parsing each line into the sets you require, then do you line per line intersections, finally open an output file and write out your result. My gut tells me the bottleneck is the disk IO. — tutuDajuju, Feb 13 '16 at 16:23
Thank you everyone for the responses I got an incredibly good answer here: http://codereview.stackexchange.com/questions/119862/python-compare-every-line-in-file-with-itself. Unfortunately, I decided not even to use this program as the filter just removes user data based on an arbitrary filtration system. These things happen I guess when picking up the pieces from where others left off. Thanks again! — Kenneth Orton, Feb 13 '16 at 16:46

Serge Seredenko · Answer 1 · 2016-02-13T17:04:50.193

This answer is not about how to split code in functions, name variables etc. It's about faster algorithm in terms of complexity.

I'd use a dictionary. Will not write exact code, you can do it yourself.

Sets = dict()
for rowID, row in enumerate(Rows):
  for userID in row:
     if Sets.get(userID) is None:
       Sets[userID] = set()
     Sets[userID].add(rowID)

So, now we have a dictionary, which can be used to quickly obtain rownumbers of rows containing given userID.

BadRows = set()
for rowID, row in enumerate(Rows):
  Intersections = dict()
  for userID in row:
    for rowID_cmp in Sets[userID]: 
      if rowID_cmp != rowID:
        Intersections[rowID_cmp] = Intersections.get(rowID_cmp, 0) + 1
  # Now Intersections contains info about how many "times"
  # row numbered rowID_cmp intersectcs current row
  filteredOut = False
  for rowID_cmp in Intersections:
    if Intersections[rowID_cmp] >= filterValue:
      BadRows.add(rowID_cmp)
      filteredOut = True
  if filteredOut:
    BadRows.add(rowID)

Having rownumbers of all filtered out rows saved to BadRows, now we do iteration one last time:

for rowID, row in enumerate(Rows):
  if rowID not in BadRows:
    # output row

This works in 3 scans and in O(nlogn) time. Maybe you'd have to rework iterating Rows array, because it's a file in your case, but doesn't really change much.

Not sure about python syntax and details, but you get the idea behind my code.

score -1 · Accepted Answer · edited May 23 '17 at 12:32

First of all, please pack your the code into functions which do one thing well.

def get_data(*args):
    # get the data.

def find_intersections_sets(list1, list2):
    # do the intersections part.

def loop_over_some_result(result):
    # insert assertions so that you don't end up looping in infinity:
    assert result is not None
    ...

def myfunc(*args):
    source1, source2 = args
    L1, L2 = get_data(source1), get_data(source2)
    intersects = find_intersections_sets(L1,L2)
    ...

if __name__ == "__main__":
    myfunc()

then you can easily profile the code using:

if __name__ == "__main__":
    import cProfile
    cProfile.run('myfunc()')

which gives you invaluable insight into your code behaviour and allows you to track down logical bugs. For more on cProfile, see How can you profile a python script?

An option to track down a logical flaw (we're all humans, right?) is to user a timeout function in a decorate like this (python2) or this (python3):

Hereby myfunc can be changed to:

def get_data(*args):
    # get the data.

def find_intersections_sets(list1, list2):
    # do the intersections part.

def myfunc(*args):
    source1, source2 = args
    L1, L2 = get_data(source1), get_data(source2)

    @timeout(10) # seconds <---- the clever bit!
    intersects = find_intersections_sets(L1,L2)
    ...

...where the timeout operation will raise an error if it takes too long.

Here is my best guess:

import ast 

def get_data(filename):
    with open(filename, 'r') as fi:
        data = fi.readlines()
    return data

def get_ast_set(line):
    return set(ast.literal_eval(line))

def less_than_x_in_common(set1, set2, limit=3):
    if len(set1.intersection(set2)) < limit:
        return True
    else:
        return False

def check_infile(datafile, savefile, filtervalue=3):
    list1 = [get_ast_set(row) for row in get_data(datafile)]
    outlist = []
    for row in list1:
        if any([less_than_x_in_common(set(row), set(i), limit=filtervalue) for i in outlist]):
            outlist.append(row)
    with open(savefile, 'w') as fo:
        fo.writelines(outlist)

if __name__ == "__main__":
    datafile = str(arg2 + arg1 + "/file")
    savefile = str(arg2 + arg1 + "/filteredSets" + str(filterValue))
    check_infile(datafile, savefile)

Seems like the answers I am getting are not what I was expecting. This is just a section of the code...the code works...I do have a profiler as this is only one step in the analysis of the data and there is a command line driver. I guess I was expecting to hear more about a possible data type I could use to traverse and compare each line with all other lines in the file more efficiently because, as it stands, the nested for is On2. Adding 10 seconds to each comparative step between sets of 20000 long line file is like swimming in honey. — Kenneth Orton, Feb 13 '16 at 13:15
That's why you have to profile your code. Python devs put a lot of effort into making it fast. You rarely have to worry about data types. In 99/100 cases it's the coder who missed some logic and is either doing things twice or duplicating information instead of incrementing, i.ex. L1=L2[:]+[n] instead of L2.append(n). — root-11, Feb 13 '16 at 14:21
Sorry, missed the if clause in `for row in list1`. added now. — root-11, Feb 13 '16 at 16:11
@KennethOrton If this is not what you expect, then just write that I should delete this answer. No hard feelings. — root-11, Feb 13 '16 at 16:16
@KennethOrton "Why are you hounding me for these simple omissions that are obviously not applicable to providing help". Nobody is hounding anyone. This is stack exchange where people try to help others without necessarily having a lot of insight to what their exact thinking is. Short iterations of feedback is what normally solves the problem. Don't take feedback personal. Any response on SE is just a polite proposal sent from a well-intending person. — root-11, Feb 13 '16 at 16:22

Python compare every line in file with all others

2 Answers2