3

I have an odd problem. According to posts like this one I would expect IDLE to be slower than running my code on command line. However, I see the complete opposite.

The program compares two files and pairs matching lines together and writes all matches to a new file. I think its similar to a join in SQL. But I need to ignore lines without matches. Here is an overview what the program does:

  • The program reads one large file ~1kb and stores keys value pairs from each line to a dictionary
  • Then it starts reading the other large file ~1kb. For each line it tests if the key in the dictionary is present in the new file. If so it writes the pairs to a new file.

It seems like the program gets stuck when it tries to access the very large dictionary. It takes about 2-3 minutes to run in IDLE but after 1 hr the program is still not done on command line. When I access the write_file its writing to it just advancing at a very slow rate.

Here is some simplified data of the first file where the data is separated by a tab and the number is the key and the value is the info:

20\tinfo_first_file_20\n

18\tinfo_first_file_18\n

Here is an example of the second file:

20\tinfo_second_file_20\n

30\tinfo_second_file_20\n

Here is an example of the file being written:

20\tinfo_first_file_20\t20\tinfo_second_file_20\n

Function

def pairer(file_1, file_2, write_file):
    wf = open(write_file, 'w')
    f1 = open(file_1, 'r')

    line = f1.readline()
    d = {}
    while line != "":
        key, value = line.strip('\n').split('\t')
        d[key] = value
        line = f1.readline()

    f2 = open(file_2, 'r')
    line_2 = f2.readline()

    while line_2 != "":
        key, value = line_2.strip('\n').split('\t')
        if key in d.keys():
            to_write = key +'\t' + d[key] + '\t' + key +'\t'+ value + '\n'
            wf.write(to_write)
        line_2 = f2.readline()

How I run the code in IDLE

if __name__=="__main__":

    pairer('file/location/of/file_1', 'file/location/of/file_2', 'file/location/of/write_file')

How I run the code in terminal

if __name__=="__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('file_1', action="store")
    parser.add_argument('file_2', action="store")
    parser.add_argument('write_file', action="store")
    results = parser.parse_args()
    pairer(results.file_1, results.file_2, results.write_file)

All the code is a simplification of the actual code. I hope I included enough to allow someone to point me in the correct direction but not too much so I keep it to the point. I am new to programming so this might be an obvious problem but I haven't been able to find anything.

Is there a maximum size of a dictionary in using terminal? Is it stored differently which ends up maxing out my memory?

I have a Mac OSX 10.8. Tkinter is updated. I'm using python2.7. Thanks in advance.

EDIT:

Before the program reaches this point it does have about 30 mins of other analysis to do. But it only fails here. Not sure if that is related. The other part is just to separate to huge files ~30kb each into 22 smaller files. No dictionaries are involved here and the speed is about the same. So I can deal with the data on a smaller level.

EDIT 2:

Does the memory clear differently when using terminal?

Also another thing that I noticed: When I look at the Activity Monitor App it seems to use much more of the CPU when I run the code in IDLE. I looks like it is using more than one processor but this doesn't make sense. Since my code isn't written to run in parallel. Also the computer makes more noise when I run it in IDLE. Not very quantitative but an observation.

Community
  • 1
  • 1
Samantha
  • 321
  • 2
  • 11
  • 2
    See http://stackoverflow.com/questions/11241523/why-does-python-code-run-faster-in-a-function – Fredrik Pihl Jun 12 '13 at 19:33
  • 1
    @FredrikPihl +1 - Didn't know that, but in this instance, both versions do the bulk of the work in the `pairer()` function, so it doesn't seem as if that would account for the discrepancy. – Aya Jun 12 '13 at 19:39
  • @Aya - you are correct. Didn't really look at the code :-) lokkup of member variables are dead slow btw but that doesn't explain it either... – Fredrik Pihl Jun 12 '13 at 19:51
  • 1
    Can you clarify about the size of the files and the analysis you need to do? The times you mention seem very long for files of a few 10s of kb. If you're doing some complex processing the time difference might be due to a difference there, i.e. it may not be possible to answer the question from the info given. – Stefan Jun 22 '13 at 09:50

1 Answers1

0

a 1kb file sounds pretty small, but here are a couple tips that might solve the problem, since the question is a little vague:

using argparse means that in commandline you'd need something like:

python prog.py --file_1 file --file_2 file --write_file output

I'm not sure if this is what you wanted. you could just keep it simple and do something like:

if __name__ == '__main__':
    file_1 = sys.argv[1]
    file_2 = sys.argv[2]
    write_file = sys.argv[3]
    pairer(file_1, file_2, write_file)

with which you would call it like:

python prog.py file_1 file_2 write_file

also, this is mostly a style issue, but I'd modify the pairer a little bit - use a for loop over the file, and don't create a keys() list.

def pairer(file_1, file_2, write_file):

    d = {}
    # using 'with' prevents lost resources!
    with open(file_1, 'r') as f1:
        for line in f1:
            # no arguments in strip clears all whitespace
            key, value = line.strip().split('\t')
            d[key] = value

    f2 = open(file_2, 'r')
    line_2 = f2.readline()
    with open(file_2, 'r') as f2, open(write_file, 'w') as wf:
        key, value = line_2.strip().split('\t')
        # don't do 'if key in d.keys()' because .keys() constructs a list of keys
        # the in operator checks the key directly, which is O(1) instead of O(n)
        # this should give you a pretty big speed boost
        if key in d:
            # this is probably a trivial speed difference, but you could try it this way:
            to_write = '\t'.join([key, d[key], key, value + '\n'])
            wf.write(to_write)
Chris Pak
  • 121
  • 6