How can i use multiprocess when processing big data with python

Question





df = pd.DataFrame()
for chunk in pd.read_csv('rows.csv',skipinitialspace=True,encoding='utf8', engine='python', chunksize=1000):
    df = pd.concat([df, chunk], ignore_index=True)

class ParalelMultiProcess():
 global df
 def cretatedf():
  df.dropna(inplace=True)
  


class Compare:
 global df
 def read_files(i):
    
    x=(i+1)
    for  t in  range(x,2000):
       try:
      
        print(str(k)+str(df["Product"].iloc[i])+" "+str(df["Product"].iloc[t]))
       
       except:
         string="Something went wrong"
         pools.append(string)

  


class ParallelExtractor:
 def __init__(self) :
  with concurrent.futures.ThreadPoolExecutor() as executor:
     executor.submit(ParalelMultiProcess.createdf())
    
 def runprocess(self):
 

  start_time=time.time()
  

  with multiprocessing.Pool(processes=20) as pool:#computer freeze here
   pool.map(Compare.read_files, range(1,2000))
    

  
  print(time.time()-start_time)

I'm trying to organize large data and compare rows in each column with each other. and this process is taking a long time so i want to use multiprocess and reduce working time but when i run this code program is stopping or computer is freezing .I tried to do it using thread, but it does not reduce the program's runtime and my aim is to process this data faster.How do I process 1m of data by giving multiprocess? What should I do so that the computer does not freeze? What is the maximum number of processes I can give?

Your loading loop will have terrible performance. It has quadratic O(N^2) complexity. Easy fix: Don’t call concat() in every iteration of the loop. Instead, keep a list of df items and concat them all at once after the loop finishes. — Stuart Berg, Nov 19 '22 at 15:21
Also, a chunk size of 1000 is probably smaller than optimal. Try something 10x or 100x larger. — Stuart Berg, Nov 19 '22 at 15:21
I suspect there is no need to resort to multiprocessing, but it’s hard to know unless you actually include the comparison code you intend to use. As far as I can tell, your code above doesn’t actually compare anything, it just prints. — Stuart Berg, Nov 19 '22 at 15:22
BTW, there is no benefit to reading in chunks if you aren’t going to process each chunk separately. — Stuart Berg, Nov 19 '22 at 15:26
PS ... on windows you must put your code inside main condition https://stackoverflow.com/questions/20222534/python-multiprocessing-on-windows-if-name-main — Ahmed AEK, Nov 19 '22 at 15:31
I didn't put the comparison code to avoid confusing the question. There is 1.5 million data in my csv file and I am asked to compare, but I need to reduce the time for this. I don't know how to do it with the process, but this is what they want to me @StuartBerg — pzr, Nov 19 '22 at 17:18
1.5 million rows is not all that's much data, depending on the number of columns. If your comparison is simple, it should take mere seconds to run, without multiprocessing. If your comparison code is complex, then optimizing it may be simpler than multiprocessing. Hence, I recommend sharing the actual comparison code. Even if multiprocessing is the right answer, the exact multiprocessing solution may depend on what you need to compute. — Stuart Berg, Nov 19 '22 at 19:06

How can i use multiprocess when processing big data with python

0 Answers0