Multiprocessing pandas dataframe chunks

Question

I have a massive csv file I am processing (15+ GB). I am using fuzzy matching to pull out rows but the script seems to only be using 1 core when I check resource monitor, and is taking a really long time to process. This is an example of the current script

with open('output.txt', 'w', newline='', encoding = 'utf8') as fw:
    writer = csv.writer(fw, delimiter = ',',lineterminator = '\n')
    for chunk in pd.read_csv("inputfile.csv", chunksize =10000, sep = ','):
        for index,row in chunk.iterrows():
            if (fuzz.token_set_ratio("search_terms","{0}".format(row['item_name'])) >90 and row['brand'] != 'example_brand'):
                print(row['item_name'],row['brand']) # this is just for visual confirmation since the script runs for hours and hours.
                line = (row['id'],row['brand'],row['item_name'])
                writer.writerow(line)

I want to get this set up so that the chunks are distributed to multiple processes using multiprocessing.pool, but I'm pretty new to python and haven't had any luck following examples and getting it to work. The script below is pegging all 4 cpu cores, and seems to be making a bunch of processes which are then immediately terminated without doing anything as far as I can tell. Anybody know why its behaving like this, and how to get it functioning correctly?

def fuzzcheck(chunk):
    for index,row in chunk.iterrows():
        if (fuzz.token_set_ratio("search_terms","{0}".format(row['item_name'])) >90 and row['brand'] != "example_brand"):
            print(row['item_name'],row['brand'])
            line = (row['ID'],row['brand'],row['item_name'])
            writer.writerow(line)

with mp.Pool(4) as pool, open('output.txt', 'w', newline='', encoding = 'utf8') as fw:
    writer = csv.writer(fw, delimiter = ',',lineterminator = '\n')
    for chunk in pd.read_csv("inputfile.csv", chunksize =10000, sep = ','):
        pool.apply(fuzzcheck, chunk)

Take a look at this answer [here](https://stackoverflow.com/a/26598452/9059420) — Darkonaut, Jul 26 '18 at 21:50
This helped me get the script sorted out, but still did not work. After much more digging I found out that Spyder just doesn't like to do multiprocessing unless you launch it in a new window when running. — Jakewb89, Jul 30 '18 at 16:14

score 0 · Accepted Answer · answered Jul 30 '18 at 16:15

0

Answer was contained here: No multiprocessing print outputs (Spyder)

Turns out Spyder just doesn't run multiprocessing unless launched in a new window.

answered Jul 30 '18 at 16:15

Jakewb89

23
5

Multiprocessing pandas dataframe chunks

1 Answers1