0

This is a script to calculate histogram, and I find the lib csv.py takes most time. How can I run it paralleled ?

The size of input file samtools.depth.gz is 14G, contains about 3 billion lines.

SamplesList = ('Sample_A', 'Sample_B', 'Sample_C', 'Sample_D')
from collections import Counter
cDepthCnt = {key:Counter() for key in SamplesList}
cDepthStat = {key:[0,0] for key in SamplesList} # x and x^2

RecordCnt,MaxDepth = inStat('samtools.depth.gz')
print('xxx')

def inStat(inDepthFile):
    import gzip
    import csv
    RecordCnt = 0
    MaxDepth = 0
    with gzip.open(inDepthFile, 'rt') as tsvfin:
        tsvin = csv.DictReader(tsvfin, delimiter='\t', fieldnames=('ChrID','Pos')+SamplesList )
        RecordCnt += 1
        for row in tsvin:
            for k in SamplesList:
                theValue = int(row[k])
                if theValue > MaxDepth:
                    MaxDepth = theValue
                cDepthCnt[k][theValue] += 1
                cDepthStat[k][0] += theValue
                cDepthStat[k][1] += theValue * theValue
    return RecordCnt,MaxDepth

cProfile

There are ways to read huge file into chunks and distribute them with list, like https://stackoverflow.com/a/30294434/159695 :

bufsize = 65536
with open(path) as infile: 
    while True:
        lines = infile.readlines(bufsize)
        if not lines:
            break
        for line in lines:
            process(line)

However, csv.DictReader only accepts file handles.

There is a way to split to temporary files at https://gist.github.com/jbylund/c37402573a896e5b5fc8 , I wonder whether I can use fifo to do it on-the-fly.


I just find csv.DictReader accepts any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable.

I have modify inStat() to accept lines. Would you please help me to complete statPool() ?

def statPool(inDepthFile):
    import gzip
    RecordCnt = 0
    MaxDepth = 0
    cDepthCnt = {key:Counter() for key in SamplesList}
    cDepthStat = {key:[0,0,0,0,0] for key in SamplesList} # x and x^2
    with gzip.open(inDepthFile, 'rt') as tsvfin:
        while True:
            lines = tsvfin.readlines(ChunkSize)
            if not lines:
                break
            with Pool(processes=4) as pool:
                res = pool.apply_async(inStat,[lines])
                iRecordCnt,iMaxDepth,icDepthCnt,icDepthStat = res.get()
            RecordCnt += iRecordCnt
            if iMaxDepth > MaxDepth:
                MaxDepth = iMaxDepth
            for k in SamplesList:
                cDepthCnt[k].update(icDepthCnt[k])
                cDepthStat[k][0] += icDepthStat[k][0]
                cDepthStat[k][1] += icDepthStat[k][1]
    return RecordCnt,MaxDepth,cDepthCnt,cDepthStat

I think asyncio.Queue seems be a good way to pipe to multiple csv.DictReader workers.

Galaxy
  • 1,862
  • 1
  • 17
  • 25

1 Answers1

0

Looking up things in global scope takes longer then looking up stuff in local scope.

You do a lot of lookups - I suggest changing your code to:

cDepthCnt = {key:Counter() for key in SamplesList}
cDepthStat = {key:[0,0] for key in SamplesList} # x and x^2

RecordCnt,MaxDepth = inStat('samtools.depth.gz', cDepthCnt, cDepthStat)
print('xxx')

def inStat(inDepthFile, depthCount, depthStat):
    # use the local depthCount, depthStat

to speed that part up by some.

Running parallellized when accessing the same keys over and over will introduce locks on those values to avoid mishaps - locking/unlocking takes time as well. You would have to see if it is faster.

All you do is summing up values - you could partition your data and use the 4 parts for 4(times 2) different dictionarys and afterwards add up the 4 dicts into your global one to avoid locks.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • Yes, it is 10% faster for looking up now. 22 ns instead of 25 ns. – Galaxy Feb 12 '19 at 08:46
  • Hi, I just use `multiprocessing.Pool` to write one working in 1 job queue mode in question part. Would you help me to complete it ? – Galaxy Feb 12 '19 at 09:55
  • @Galaxy: 22ns is the same amount of time as 25ns, if there is no difference in the order of magnitude -> there is no difference at all – Azat Ibrakov Feb 12 '19 at 12:02
  • @Azat - so a "non-magnitude" change in performance is not worth anything? This is 10% less time used - which might translate in 10% of a datacenters calculation power feasable to be put to other uses by such a "simple" optimization – Patrick Artner Feb 12 '19 at 12:19
  • @PatrickArtner: if it is really 10% -- then it's great, if it's 3 nanoseconds, then it can be early to say if there were any savings at all, it depends on accuracy of measurements, probably it should be tested on a greater dataset to be sure – Azat Ibrakov Feb 12 '19 at 12:24
  • For the first 100000 lines, csv.py took 62 ns and my counting took 25 ns. – Galaxy Feb 13 '19 at 02:32
  • @Galaxy Unless you are dealing with files with 100.000.000+ and above range this won't even make a dent in the execution time. How big is your data? Are there constraints on how fast this should work (must do 1e10 lines in < 1s else the nuclear reactor explodes)? Did you try pandas or numpy? Why reading text files if you seem to work with binarys? Filesize and reading would be much faster if you had binary stored numbers, etc. – Patrick Artner Feb 13 '19 at 06:18
  • The size of input file `depth.tsv.gz` is 14G, contains about 3 billion lines. – Galaxy Feb 22 '19 at 07:08