1

What I am trying to do?

A Python script that would read a large file (1-8G), match on a field called record (which can be "r" or "f") and can have same ID. If the ID is same, combine the "f" and "r" lines and save it to dictionary with ID as the key. In order to speed up the process, I split the file in multiple small ones and ran the function on separate files in different processes. In the end combine the dictionaries created by each process.

First I tried using multiprocessing Pool, but it is making the processing slower. Read somewhere to try multiprocessing Processes and Queue, but I am trying to put dictionary (which is large) in the Queue and looks like Queue is not able to handle the size. The script hangs at this point.

Any suggestions on how I can improve the processing time?

PS: I am sure this is CPU bound because I did a "pass" while reading the lines and it took 97sec and when ran the entire script, it took 600 sec.

import multiprocessing
def rLine(line, resultDict):
     id= line[2]
     rOutput= line[4]+" "+line[5] #this is a long string
     rList= rOutput.split(" ")
     resultDict[id].extend(rList)

def fLine(line, resultDict):
     id= line[7]
     fOutput= line[4]+" "+line[5] #this is a long string
     fList= rOutput.split(" ")
     resultDict[id].extend(fList)

def mainFunction(thisFile, output):
    resultDict= defaultdict(list)
    with gzip.open(thisFile) as f:
        for lines in f:
            line= lines.split(" ")
            record= line[0]
            if record=="f":
               fLine(line, resultDict)
            else:
               rLine(line, resultDict)
   output.put(resultDict)
if __name__=='__main__':
   finalDict= defaultdict(list)
   output= multiprocessing.Queue()
   processes= list()
   files= ["f1.log.gz", "f2.log.gz"] #long list
   for f in files:
       process.append(multiprocessing.Process(target=mainFunction, args=(f,output))
   for p in processes:
       p.start()
   for p in processes:
       p.join()
   for p in process:
       finalDict.update(output.get())
   print finalDict
Praniti Gupta
  • 149
  • 2
  • 3
  • 11
  • when you ran the process without multiprocessing, was the CPU at 100%? – hansaplast Jan 12 '18 at 20:29
  • I think you need to pass `output` into the calls to `mainFunction`. – quamrana Jan 12 '18 at 20:31
  • how many ids do you get? Are there like a thousand different `id`s, or is it just around 10? – hansaplast Jan 12 '18 at 20:35
  • @quamrana, I had done that. Editing the code. I still face the same problem. The reason why I am saying "Queue is not able to handle the size", because if I do output.put("some random text"), the code works. – Praniti Gupta Jan 12 '18 at 20:36
  • @hansaplast, I didn't check if CPU was 100%. Will do that. To answer your 2nd question, there will be thousands of different IDs. – Praniti Gupta Jan 12 '18 at 20:37
  • best you check the cpu of all the cores. If it is CPU bound that you would see one core going to 100%. If you're on linux then try htop – hansaplast Jan 12 '18 at 20:40
  • 1
    How does your code even work when you are not passing `resultDict` into both of `rLine` and `fLine`? – quamrana Jan 12 '18 at 20:41
  • also `finalDict.update(output.get())` will overwrite dict items instead of extending them. That's most probably not what you want. – hansaplast Jan 12 '18 at 20:47
  • 1
    @quamrana, sorry. Missed out passing resultDict in the code I pasted here. Have edited it. – Praniti Gupta Jan 12 '18 at 20:49
  • @hansaplast, I had the same reservation but it has worked till now. I am still working on fixing that part. My main concern is to reduce the processing time. 600 sec for a 1G file is way too long. My files are usually 5-10G. – Praniti Gupta Jan 12 '18 at 20:52
  • the code by itself does not look cpu bound. All you do is split up a line and then append to a dict. The gunzip part is maybe cpu bound. If you could add the cpu measurements (and also if the cpu is constant or it only spikes at the gunzip and then goes down), that would help solving this question – hansaplast Jan 12 '18 at 21:07
  • 1
    Possible duplicate of [Python 3 Multiprocessing Pool is slow with large variables](https://stackoverflow.com/questions/45155377/python-3-multiprocessing-pool-is-slow-with-large-variables) – noxdafox Jan 12 '18 at 23:43
  • @hansaplast, The CPU utilization without multiprocessing is 100%. Can we safely say this is cpu bound? – Praniti Gupta Jan 17 '18 at 00:30
  • @PranitiGupta is it 100% all the time or just at the beginning? – hansaplast Jan 17 '18 at 05:21
  • also: how big is your filed unzipped? you basically keep the file in memory (at least the id (line[7]), and line[4] and line[5] from which you say they are "long"). Maybe you run into RAM saturation? – hansaplast Jan 17 '18 at 05:24
  • @hansaplast, it is not 100% all the time. It keeps flipping between 50-100%. Initially it is 50%, extends to 100% and fall back. I even used psutil and it take 48% in going through each line `for lines in f` and 30% in doing the parsing function `rLine()` – Praniti Gupta Jan 17 '18 at 18:55
  • it is hard to help without being able to replicate the problem. Can you share one of those gzip files? Or is it confidential data? – hansaplast Jan 17 '18 at 19:23
  • Unfortunately, I cannot share the gzip file. It is confidential. Thanks for your suggestion, though. – Praniti Gupta Jan 17 '18 at 19:32

0 Answers0