With respect to this amazing answers and this blogpost I still have a small question. Namely, the blog post states from the benchmarking that threading will be slower than without threading due to the GIL:
simple threading multiprocessing
threads = 2 4.124 5.539 2.034
threads = 3 6.391 13.772 3.376
threads = 4 9.194 17.641 4.720
So threading is even slower than simple execution. This is understood from the behaviour of GIL discussed above and should not surprise us now.
I benchmarked my own function (scrapping the data and writing it to file) in the same manner as in the post. And I have following results:
simple 15 mins, threading: 10 mins, multiprocessing 5 mins.
So, why can threading be faster than simple method without any threading?
EDIT: Small Description of functions
for thread in range(4):
process = multiprocessing .Process(name=str(thread), target=perform_extraction, args=(ranging[thread],))
#process = Thread(name=str(thread), target=perform_extraction, args=(ranging[thread],))
process.start()
processes.append(process)
for process in processes:
process.join()
def perform_extraction(ranges):
thread_name = multiprocessing.current_process().name
#thread_name = currentThread().getName()
for page in ranges:
data = extract_data(page)
write_data(data, thread_name+'.txt')