I have a question regarding Python multiprocessing.
I have a large csv file: test.csv
with 2 million lines and 2 columns:firm_id
,product_id
,in which the last one, namely product_id
is a input of another function,say func1(product_id)
.
That's all about the basic information, as the file is very large and each product_id
can be processed independently, I want to utilize Python's multiprocessing, which I never touched before. After Googling for a while, I found some useful information (like this and this) but none of them enable me to complete my task. I tried with the last one and edited like shows below, but it did not work,
import itertools as IT
import multiprocessing as mp
import csv
import funcitons as fdfunc # a self defined module with function func1 in it
def worker(chunk):
return len(chunk)
def main(): # num_procs is the number of workers in the pool
num_procs = 2
# chunksize is the number of lines in a chunk
chunksize = 10**5
pool = mp.Pool(num_procs)
largefile = 'test.csv'
results = []
with open(largefile, 'r') as f,\
open('file_to_store_resutl.csv','a+') as res_file:
reader = csv.reader(f)
for chunk in iter(lambda: list(IT.islice(reader, chunksize*num_procs)), []):
chunk = iter(chunk)
pieces = list(iter(lambda: list(IT.islice(chunk, chunksize)), []))
result = pool.imap(fdfunc.func1, pieces['product_id']) #pieces['product_id'] this definitely is wrong, just to show what I want to do
writer = csv.writer(res_file)
for item in result:
writer.write_row(item)
results.append(result)
main()
Anyone know how to do what I was trying?