I am trying to read a large text file > 20Gb with python. File contains positions of atoms for 400 frames and each frame is independent in terms of my computations in this code. In theory I can split the job to 400 tasks without any need of communication. Each frame has 1000000 lines so the file has 1000 000* 400 lines of text. My initial approach is using multiprocessing with pool of workers:
def main():
""" main function
"""
filename=sys.argv[1]
nump = int(sys.argv[2])
f = open(filename)
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
cursor = 0
framelocs=[]
start = time.time()
print (mp.cpu_count())
chunks = []
while True:
initial = s.find(b'ITEM: TIMESTEP', cursor)
if initial == -1:
break
cursor = initial + 14
final = s.find(b'ITEM: TIMESTEP', cursor)
framelocs.append([initial,final])
#readchunk(s[initial:final])
chunks.append(s[initial:final])
if final == -1:
break
Here basically I am seeking file to find frame begins and ends with opening file with python mmap module to avoid reading everything into memory.
def readchunk(chunk):
start = time.time()
part = chunk.split(b'\n')
timestep= int(part[1])
print(timestep)
Now I would like to send chunks of file to pool of workers to process. Read part should be more complex but those lines will be implemented later.
print('Seeking file took %8.6f'%(time.time()-start))
pool = mp.Pool(nump)
start = time.time()
results= pool.map(readchunk,chunks[0:16])
print('Reading file took %8.6f'%(time.time()-start))
If I run this with sending 8 chunks to 8 cores it would take 0.8 sc to read. However If I run this with sending 16 chunks to 16 cores it would take 1.7 sc. Seems like parallelization does not speed up. I am running this on Oak Ridge's Summit supercomputer if it is relevant, I am using this command:
jsrun -n1 -c16 -a1 python -u ~/Developer/DipoleAnalyzer/AtomMan/readlargefile.py DW_SET6_NVT.lammpstrj 16
This supposed to create 1 MPI task and assign 16 cores to 16 threads. Am I missing here something? Is there a better approach?