This thread explains what the CPU bound, IO bound problems.
Given that the Python
has GIL
structure, someone recommended,
• Use threads for I/O bound problems
• Use processes, networking, or events (discussed in the next section) for CPU-bound problems
Honestly, I cannot fully and intuitively understand what these problems really are.
Here is the situation I'm faced with:
crawl_item_list = [item1, item2, item3 ....]
for item in crawl_item_list:
crawl_and_save_them_in_db(item)
def crawl_and_save_them_in_db(item):
# Pseudo code
# crawled_items = crawl item from the web // the number of crawled_items is usually 200
# while crawled_items:
# save them in PostgreSQL DB
# crawled_items = crawl item from the web
This is the task that I want to perform with parallel processes or thread.
(Each process(or thread) will have their own crawl_and_save_them_in_db
and deals with each item
)
In this case, which one should I choose between multi-processes(something like Pool
) and multi-thread?
I think that since the main job of this task is storing the DB
, which is kind of IO bound task(Hope it is..), so I have to use multi thread? Am I right?
Need your advices.