Apologies for the poor title, I was having trouble figuring out how to word it.
I've written a python program that reads in a file of words, stores them in a list, then iterates over the list and performs a check on each item.
I'm now trying to speed this up as the list is quite large.
I'm trying to use the python multiprocessing module to achieve this. I have included an example of the code below but I have expanded out some loops to make more clear what is going on. Essentially what I am trying to do is split out the list into 10 parts, then send each part to a separate process. The program works and returns the expected result but the checking part of it takes around ~22 seconds to run.
import time
import pickle
import multiprocessing as mp
import check
def check_v3(read_list, query, return_list):
if isinstance(query, str):
query = list(query)
query_len = len(query)
if query_len == 1:
results = check.check_v3_1(read_list, query)
if query_len == 2:
results = check.check_v3_2(read_list, query)
if query_len == 3:
results = check.check_v3_3(read_list, query)
elif query_len == 4:
results = check.check_v3_4(read_list, query)
elif query_len == 5:
results = check.check_v3_5(read_list, query)
elif query_len == 6:
results = check.check_v3_6(read_list, query)
elif query_len == 7:
results = check.check_v3_7(read_list, query)
return_list.append(results)
def read_pickle(file_name):
with open(file_name, "rb") as fin:
read_list = pickle.load(fin)
return read_list
if __name__ == "__main__":
read_list = read_pickle("pickled_list")
split_list_1 = read_list[:(round(len(read_list)/10))]
split_list_2 = read_list[(round(len(read_list)/10)*1):(round(len(read_list)/10)*2)]
split_list_3 = read_list[(round(len(read_list)/10)*2):(round(len(read_list)/10)*3)]
split_list_4 = read_list[(round(len(read_list)/10)*3):(round(len(read_list)/10)*4)]
split_list_5 = read_list[(round(len(read_list)/10)*4):(round(len(read_list)/10)*5)]
split_list_6 = read_list[(round(len(read_list)/10)*5):(round(len(read_list)/10)*6)]
split_list_7 = read_list[(round(len(read_list)/10)*6):(round(len(read_list)/10)*7)]
split_list_8 = read_list[(round(len(read_list)/10)*7):(round(len(read_list)/10)*8)]
split_list_9 = read_list[(round(len(read_list)/10)*8):(round(len(read_list)/10)*9)]
split_list_10 = read_list[(round(len(read_list)/10)*9):]
query = "check"
manager = mp.Manager()
return_list = manager.list()
p1 = mp.Process(target=check_v3, args=(split_list_1, query, return_list))
p2 = mp.Process(target=check_v3, args=(split_list_2, query, return_list))
p3 = mp.Process(target=check_v3, args=(split_list_3, query, return_list))
p4 = mp.Process(target=check_v3, args=(split_list_4, query, return_list))
p5 = mp.Process(target=check_v3, args=(split_list_5, query, return_list))
p6 = mp.Process(target=check_v3, args=(split_list_6, query, return_list))
p7 = mp.Process(target=check_v3, args=(split_list_7, query, return_list))
p8 = mp.Process(target=check_v3, args=(split_list_8, query, return_list))
p9 = mp.Process(target=check_v3, args=(split_list_9, query, return_list))
p10 = mp.Process(target=check_v3, args=(split_list_10, query, return_list))
start_time = time.time()
p1.start()
p2.start()
p3.start()
p4.start()
p5.start()
p6.start()
p7.start()
p8.start()
p9.start()
p10.start()
p1.join()
p2.join()
p3.join()
p4.join()
p5.join()
p6.join()
p7.join()
p8.join()
p9.join()
p10.join()
print("--- %s seconds ---" % (time.time() - start_time))
print(return_list)
I thought this was taking longer than expected so I tried something else to see if it would still take as long (see below code). I essentially copy and pasted the python code 4 times but statically defined in each program that they would only be running the checks on 1/4 of the same list that was given to the original program (each program would get a different quarter). They would then output a pickled version of the list before finally another script would run which would compile together the 4 separate pickled lists returned by the programs. When I run this bash script, the checking part of each program takes under 2 seconds to run.
#!/bin/bash
python3 check_1_4.py &
python3 check_2_4.py &
python3 check_3_4.py &
python3 check_4_4.py &
wait
python3 -i read_4split.py
I'm not too sure why there is such a big difference between the python script and the bash script just telling multiple python scripts to run. I'm sure there is something obvious I am missing here but I just can't seem to find what it is.