I have a big text file that needs to be processed. I first read all text into a list and then use ThreadPoolExecutor to start multiple threads to process it. The two functions called in process_text() are not listed here: is_channel and get_relations().
I am on Mac and my observations show that it doesn't really speed up the processing (cpu with 8 cores, only 15% cpu is used). If there is a performance bottleneck in either the function is_channel or get_relations, then the multithreading won't help much. Is that the reason for no performance gain? Should I try to use multiprocessing to speed up instead of multithreading?
def process_file(file_name):
all_lines = []
with open(file_name, 'r', encoding='utf8') as f:
for index, line in enumerate(f):
line = line.strip()
all_lines.append(line)
# Classify text
all_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for index, result in enumerate(executor.map(process_text, all_lines, itertools.repeat(channel))):
all_results.append(result)
for index, entities_relations_list in enumerate(all_results):
# print out results
def process_text(text, channel):
global channel_text
global non_channel_text
is_right_channel = is_channel(text, channel)
entities = ()
relations = None
entities_relations_list = set()
entities_relations_list.add((entities, relations))
if is_right_channel:
channel_text += 1
entities_relations_list = get_relations(text, channel)
return (text, entities_relations_list, is_right_channel)
non_channel_text += 1
return (text, entities_relations_list, is_right_channel)