python parallel compare 2 csv files

Question

I try to compare 2 csv files, which contain 100000 row and 10 column in each file. I run this code it work, but it use only one thread of CPU while I have 8 cores. I want this code use all cpu thread. I have search and I found the idea of parallel. But when I try apply parallel to for loop in this python code, it is not work. How to apply parallel this code? thank you in advance for your help!

import csv  
#read csv files
f1= file('host.csv','r')
f2= file('master.csv','r')
f3= file('results.csv','w') 

c1=csv.reader(f1) 
c2=csv.reader(f2)
next(c2, None)
c3=csv.writer(f3)
#for loop compare row in host csv file 
master_list = list(c2) 
for row in c1: 
    row=1
    found = False
    colA = str(row[0])  #protocol
    colB = str(row[11])  
    colC = str(row[12])  
    colD = str(row[13]) 
    colE = str(row[14])  
    #loop in each row of master csv file
    for master_row in master_list:
        results_row=row
        colBf2 = str(master_row[4])  
        colCf2 = str(master_row[5])  
        colDf2 = str(master_row[6])  
        colEf2 = str(master_row[7])  
        colFf2 = str(master_row[3])
        #check condition
        if colA == 'icmp':
           #sub condiontion
           if colB == colBf2 and colD == colDf2:
              results_row.append(colFf2)
              found = True
              break
           row = row + 1
        else:
           if colB == colBf2 and colD == colDf2 and colE == colEf2:
              results_row.append(colFf2)
              found = True
              break
           row =row+1
   if not found:
      results_row.append('Not Match')
   c3.writerow(results_row)
f1.close()
f2.close()
f3.close()

Youre going to want to look into multiprocessing (if you choose to stay with python). Check out this answer to a similar question for some advice: https://stackoverflow.com/a/12293094/4383396 — Grant Williams, May 02 '18 at 16:06
does order in `results_row` matter? When parallelized, results may intermix. — tdelaney, May 02 '18 at 16:11
You have a bug. `row` is an integer, but you set `results_row=row` and later `results_row.append(colFf2)`. But its an int not a list. — tdelaney, May 02 '18 at 19:31

tdelaney · Answer 1 · 2018-05-03T04:17:33.017

The expensive task is the inner loop that rescans the master table for each host row. Since python does cooperative multithreading (you can search "python GIL") only one thread runs at a time and so multiple threads don't speed up a cpu-bound operation. You could spawn subprocesses, but then you have to weigh the cost of getting the data to the worker processes against the speed gain.

Or, optimize your code. Instead of running in parallel, index the master instead. You can exchange an expensive scan of 100000 records for a quick dictionary lookup.

I took the liberty of adding with clauses to your code to save a few lines and also skipped breaking out colA and etc... (using named indexes instead) to keep the code small.

import csv

# columns of interest
A, B, C, D, E, F = 0, 11, 12, 13, 14, 3

# read and index column F in master by (B,D) and (B,D,E), discarding
# duplicates for those keys
col_index = {}
with open('master.csv') as master:
    next(master)
    for row in csv.reader(master):
        key = row[B], row[D]
        if key not in col_index:
            col_index[key] = row[F]
        key = row[B], row[D], row[E]
        if key not in col_index:
            col_index[key] = row[F]

#read csv files
with open('host.csv') as f1, open('results.csv','w') as f3: 
    c1=csv.reader(f1)
    c3=csv.writer(f3) 
    for row in c1:
        if row[A] == "icmp":
            indexer = (row[B], row[D])
        else:
            indexer = (row[B], row[D], row[E])
        row.append(col_index.get(indexer, 'Not Match'))
        c3.writerow(row)

How could I add sub-condition to compare row in host to row in master by the column index as my code? #sub condiontion if colB == colBf2 and colD == colDf2: results_row.append(colFf2) found = True break row = row + 1 else: if colB == colBf2 and colD == colDf2 and colE == colEf2: results_row.append(colFf2) found = True break row =row+1 — Yoekleng Kuy, May 02 '18 at 18:25
I saw three conditions: `if colB == colBf2 and colD == colDf2:`, `if colB == colBf2 and colD == colDf2 and colE == colEf2:` and the default "Not Match". Those are handled by indexing by (B,D) and (B,D,E) and the default to `get`. I think I messed up by allowing more than one master row with that condition whereas you break. I'll fix that. — tdelaney, May 02 '18 at 19:28
Can master have more that one row with the same columns B and D? If so, which one should be used? — tdelaney, May 02 '18 at 19:29

python parallel compare 2 csv files

1 Answers1