I have an odd problem. According to posts like this one I would expect IDLE to be slower than running my code on command line. However, I see the complete opposite.
The program compares two files and pairs matching lines together and writes all matches to a new file. I think its similar to a join in SQL. But I need to ignore lines without matches. Here is an overview what the program does:
- The program reads one large file ~1kb and stores keys value pairs from each line to a dictionary
- Then it starts reading the other large file ~1kb. For each line it tests if the key in the dictionary is present in the new file. If so it writes the pairs to a new file.
It seems like the program gets stuck when it tries to access the very large dictionary. It takes about 2-3 minutes to run in IDLE but after 1 hr the program is still not done on command line. When I access the write_file its writing to it just advancing at a very slow rate.
Here is some simplified data of the first file where the data is separated by a tab and the number is the key and the value is the info:
20\tinfo_first_file_20\n
18\tinfo_first_file_18\n
Here is an example of the second file:
20\tinfo_second_file_20\n
30\tinfo_second_file_20\n
Here is an example of the file being written:
20\tinfo_first_file_20\t20\tinfo_second_file_20\n
Function
def pairer(file_1, file_2, write_file):
wf = open(write_file, 'w')
f1 = open(file_1, 'r')
line = f1.readline()
d = {}
while line != "":
key, value = line.strip('\n').split('\t')
d[key] = value
line = f1.readline()
f2 = open(file_2, 'r')
line_2 = f2.readline()
while line_2 != "":
key, value = line_2.strip('\n').split('\t')
if key in d.keys():
to_write = key +'\t' + d[key] + '\t' + key +'\t'+ value + '\n'
wf.write(to_write)
line_2 = f2.readline()
How I run the code in IDLE
if __name__=="__main__":
pairer('file/location/of/file_1', 'file/location/of/file_2', 'file/location/of/write_file')
How I run the code in terminal
if __name__=="__main__":
parser = argparse.ArgumentParser()
parser.add_argument('file_1', action="store")
parser.add_argument('file_2', action="store")
parser.add_argument('write_file', action="store")
results = parser.parse_args()
pairer(results.file_1, results.file_2, results.write_file)
All the code is a simplification of the actual code. I hope I included enough to allow someone to point me in the correct direction but not too much so I keep it to the point. I am new to programming so this might be an obvious problem but I haven't been able to find anything.
Is there a maximum size of a dictionary in using terminal? Is it stored differently which ends up maxing out my memory?
I have a Mac OSX 10.8. Tkinter is updated. I'm using python2.7. Thanks in advance.
EDIT:
Before the program reaches this point it does have about 30 mins of other analysis to do. But it only fails here. Not sure if that is related. The other part is just to separate to huge files ~30kb each into 22 smaller files. No dictionaries are involved here and the speed is about the same. So I can deal with the data on a smaller level.
EDIT 2:
Does the memory clear differently when using terminal?
Also another thing that I noticed: When I look at the Activity Monitor App it seems to use much more of the CPU when I run the code in IDLE. I looks like it is using more than one processor but this doesn't make sense. Since my code isn't written to run in parallel. Also the computer makes more noise when I run it in IDLE. Not very quantitative but an observation.