I'm fairly new to Python so I apologize in advance for the (numerous) errors that you will see in my code.
I have a quite simple Python script that is causing me some troubles: the code reads an output file coming from a Gate simulation, the content of this file is mostly numerical values in ASCII format. The data in the file is organized in rows and each row has 22 different numbers, the total number of rows depends on the simulation running time and can be quite large (the largest one I have at the moment is aroung 11 Gb). Here's a typical line of the file as reference:
0 0 -1 0 0 -1 -1 4.99521423437116252053158e-01 7.910e-02 1.347e+01 -1.600e+01 -1.600e+01 -1.347e+01 22 1 0 -1 -1 -1 Compton NULL NULL
What my Python code does is read the numerical values from each row of this file and extract the ones that I'm interested in; some of the extracted values are then analyzed through a for loop and finally a group of these values is written on an output file in ASCII format.
The code is working fine, however as I started to work with longer simulation time (and therefore larger input file) the computational time of the code has become too long, making it inefficient. My goal is to reduce as much as possible the computational time.
Here's part of the code that I'm currently using and is causing the slowdown:
#position of centres
PMTloc =np.array([[-270.00, -156.00], [-270.00, -104.00], [-270.00, -52.00], [-270.00, 0.00], [-270.00, 52.00], [-270.00, 104.00], [-270.00, 156.00], [-225.00, -182.00], [-225.00, -130.00], [-225.00, -78.00], [-225.00, -26.00], [-225.00, 26.00], [-225.00, 78.00], [-225.00, 130.00], [-225.00, 182.00], [-180.00, -208.00], [-180.00, -156.00], [-180.00, -104.00], [-180.00, -52.00], [-180.00, 0.00], [-180.00, 52.00], [-180.00, 104.00], [-180.00, 156.00], [-180.00, 208.00], [-135.00, -234.00], [-135.00, -182.00], [-135.00, -130.00], [-135.00, -78.00], [-135.00, -26.00], [-135.00, 26.00], [-135.00, 78.00], [-135.00, 130.00], [-135.00, 182.00], [-135.00, 234.00], [-90.00, -208.00], [-90.00, -156.00], [-90.00, -104.00], [-90.00, -52.00], [-90.00, 0.00], [-90.00, 52.00], [-90.00, 104.00], [-90.00, 156.00], [-90.00, 208.00], [-45.00, -234.00], [-45.00, -182.00], [-45.00, -130.00], [-45.00, -78.00], [-45.00, -26.00], [-45.00, 26.00], [-45.00, 78.00], [-45.00, 130.00], [-45.00, 182.00], [-45.00, 234.00], [0.00, -208.00], [0.00, -156.00], [0.00, -104.00], [0.00, -52.00], [0.00, 0.00], [0.00, 52.00], [0.00, 104.00], [0.00, 156.00], [0.00, 208.00], [45.00, -234.00], [45.00, -182.00], [45.00, -130.00], [45.00, -78.00], [45.00, -26.00], [45.00, 26.00], [45.00, 78.00], [45.00, 130.00], [45.00, 182.00], [45.00, 234.00], [90.00, -208.00], [90.00, -156.00], [90.00, -104.00], [90.00, -52.00], [90.00, 0.00], [90.00, 52.00], [90.00, 104.00], [90.00, 156.00], [90.00, 208.00], [135.00, -234.00], [135.00, -182.00], [135.00, -130.00], [135.00, -78.00], [135.00, -26.00], [135.00, 26.00], [135.00, 78.00], [135.00, 130.00], [135.00, 182.00], [135.00, 234.00], [180.00, -208.00], [180.00, -156.00], [180.00, -104.00], [180.00, -52.00], [180.00, 0.00], [180.00, 52.00], [180.00, 104.00], [180.00, 156.00], [180.00, 208.00], [225.00, -182.00], [225.00, -130.00], [225.00, -78.00], [225.00, -26.00], [225.00, 26.00], [225.00, 78.00], [225.00, 130.00], [225.00, 182.00]])
#checking number of arguments passed
n = len(sys.argv)
if n<3:
print("Error number of arguments passed is incorrect. Try again...")
print("argv[0]: {0}".format(argv[0]))
print("argv[1]: {0}".format(argv[1]))
input_file = argv[1]
output_file = argv[2]
hits = []
with open(input_file, 'r') as DataIn:
for line in DataIn:
hits += [line.split()]
totalPMT = 108
runID = [x[0] for x in hits]
runid=[int(i) for i in runID]
eventID = [x[1] for x in hits]
eventid=[int(i) for i in eventID]
time = [x[7] for x in hits]
time_fl = [float(i) for i in time]
posX = [x[10] for x in hits]
posx = [float(i) for i in posX]
posY = [x[11] for x in hits]
posy = [float(i) for i in posY]
posZ = [x[12] for x in hits]
posz = [float(i) for i in posZ]
partID = [x[13] for x in hits]
partid = [int(i) for i in partID]
PMTs = [0]*108 #initial output signal
nscinti = 0
nphoton = 0
startscinti = 0
zscinti = 0
Etres = 100 #energy treshold
numb_22 = 0
numb_0 = 0
eventIDnow = [0]*len(eventid)
eventidnow = [int(i) for i in eventIDnow]
runIDnow = [0]*len(runid)
runidnow = [int(i) for i in runIDnow]
for i in range(len(eventid)):
if ((eventid[i] > eventidnow[i]) | (runid[i] > runidnow[i])):
if ((nscinti>0) & (partid[i]==22) & (nphoton>Etres)): #check if last recorded event should be written
with open(output_file, 'a+') as DataOut:
DataOut.write("{0} {1} {2}\n".format(startscinti, zscinti, ' '.join(map(str, PMTs))))
for l in range(len(eventidnow)): #updating values of eventidnow and runidnow
eventidnow[l] = eventid[i]
runidnow[l] = runid[i]
startscinti = time_fl[i]
zscinti = posz[i]
nphoton = 0
nscinti += 1
for j in range(len(PMTs)):
PMTs[j] = 0
#checking where the event get recorded
for k in range(len(PMTs)):
if ((k<=totalPMT) & ((posx[i]-PMTloc[k][0])*(posx[i]-PMTloc[k][0]) + (posy[i]-PMTloc[k][1])*(posy[i]-PMTloc[k][1])<=23*23)):
PMTs[k] += 1
nphoton += 1
break
I'm fully aware that this code is far from being perfect and optmized so I'm open to any suggestion on how to improve it.
From the measurements that I have done it takes around 35 minutes for the script to complete its operation on a 224MB input file, with an input file of around 10GB the computational time goes above 100hrs.
I would like to add that I have access to a cluster where I can run this code and I can use up to 12-16 cores, ideally I would like to take advantage of this cores in order to improve the performance of my script. By doing some research I found out about multiprocessing and parallelization but, due to my inexperience, I'm not sure if those methods can be applied to my case nor if they would actually reduce the computational time. Nonetheless I wasn't able to implement them correctly. Any help on how I could possibly use multiprocessing or parallelization would be appreciated.
In any case I'm open to any possible suggestion to improve my code, it doesn't have to take advantage of the cluster if there are better ways to achieve my goal.
Thanks everyone for your time!
UPDATE 19/08
Thanks to the help of @DarrylG I was able to improve the computational time of my script, it is around 1/8 of my initial time, and I'm happy with this result. I'll post here the version that I'm using right now:
import sys
import numpy as np
import time
import numpy as np
import itertools
import sys
from sys import argv
import os
import time
from timeit import default_timer as timer
from datetime import timedelta
def int_or_float(s):
' Convert string from to int or float '
try:
return int(s)
except ValueError:
try:
return float(s)
except ValueError:
return s
def process(hits):
totalPMT = 108
runid, eventid, time_fl, posx, posy, posz, partid = zip(*[(x[0], x[1], x[7], x[10], x[11], x[12], x[13]) for x in hits])
runid=[int(i) for i in runid]
eventid=[int(i) for i in eventid]
time_fl = [float(i) for i in time_fl]
posx = [float(i) for i in posx]
posy = [float(i) for i in posy]
posz = [float(i) for i in posz]
partid = [int(i) for i in partid]
PMTs = [0]*108 #initial output signal
nscinti, nscinti, nphoton, startscinti, zscinti = [0]*5
Etres = 100 #energy treshold
numb_22, numb_0 = 0, 0
eventIDnow = [0]*len(eventid)
runIDnow = [0]*len(runid)
# Previous value of eventID and runID
eventIDnow = 0
runIDnow = 0
results = []
for i in range(len(eventid)):
if (eventid[i] > eventIDnow) or (runid[i] > runIDnow):
if ((nscinti>0) and (partid[i]==22) and (nphoton>Etres)):
print("Writing output file")
results.append("{0} {1} {2}\n".format(startscinti, zscinti, ' '.join(map(str, PMTs))))
eventIDnow = eventid[i]
runIDnow = runid[i]
startscinti = time_fl[i]
zscinti = posz[i]
nphoton = 0
nscinti += 1
PMTs = [0]*len(PMTs)
#checking where the event get recorded
k = next((k for k in range(len(PMTs))
if ((k<=totalPMT) and
((posx[i]-PMTloc[k][0])*(posx[i]-PMTloc[k][0]) +
(posy[i]-PMTloc[k][1])*(posy[i]-PMTloc[k][1])<=23*23))),
None)
if k:
PMTs[k] += 1
nphoton += 1
return results
#position of centres
PMTloc =np.array([[-270.00, -156.00], [-270.00, -104.00], [-270.00, -52.00], [-270.00, 0.00], [-270.00, 52.00], [-270.00, 104.00], [-270.00, 156.00], [-225.00, -182.00], [-225.00, -130.00], [-225.00, -78.00], [-225.00, -26.00], [-225.00, 26.00], [-225.00, 78.00], [-225.00, 130.00], [-225.00, 182.00], [-180.00, -208.00], [-180.00, -156.00], [-180.00, -104.00], [-180.00, -52.00], [-180.00, 0.00], [-180.00, 52.00], [-180.00, 104.00], [-180.00, 156.00], [-180.00, 208.00], [-135.00, -234.00], [-135.00, -182.00], [-135.00, -130.00], [-135.00, -78.00], [-135.00, -26.00], [-135.00, 26.00], [-135.00, 78.00], [-135.00, 130.00], [-135.00, 182.00], [-135.00, 234.00], [-90.00, -208.00], [-90.00, -156.00], [-90.00, -104.00], [-90.00, -52.00], [-90.00, 0.00], [-90.00, 52.00], [-90.00, 104.00], [-90.00, 156.00], [-90.00, 208.00], [-45.00, -234.00], [-45.00, -182.00], [-45.00, -130.00], [-45.00, -78.00], [-45.00, -26.00], [-45.00, 26.00], [-45.00, 78.00], [-45.00, 130.00], [-45.00, 182.00], [-45.00, 234.00], [0.00, -208.00], [0.00, -156.00], [0.00, -104.00], [0.00, -52.00], [0.00, 0.00], [0.00, 52.00], [0.00, 104.00], [0.00, 156.00], [0.00, 208.00], [45.00, -234.00], [45.00, -182.00], [45.00, -130.00], [45.00, -78.00], [45.00, -26.00], [45.00, 26.00], [45.00, 78.00], [45.00, 130.00], [45.00, 182.00], [45.00, 234.00], [90.00, -208.00], [90.00, -156.00], [90.00, -104.00], [90.00, -52.00], [90.00, 0.00], [90.00, 52.00], [90.00, 104.00], [90.00, 156.00], [90.00, 208.00], [135.00, -234.00], [135.00, -182.00], [135.00, -130.00], [135.00, -78.00], [135.00, -26.00], [135.00, 26.00], [135.00, 78.00], [135.00, 130.00], [135.00, 182.00], [135.00, 234.00], [180.00, -208.00], [180.00, -156.00], [180.00, -104.00], [180.00, -52.00], [180.00, 0.00], [180.00, 52.00], [180.00, 104.00], [180.00, 156.00], [180.00, 208.00], [225.00, -182.00], [225.00, -130.00], [225.00, -78.00], [225.00, -26.00], [225.00, 26.00], [225.00, 78.00], [225.00, 130.00], [225.00, 182.00]])
#checking number of arguments passed
if __name__ == "__main__":
n = len(sys.argv)
if n<3:
print("Error number of arguments passed is incorrect. Try again...")
print("argv[0]: {0}".format(argv[0]))
print("argv[1]: {0}".format(argv[1]))
input_file = argv[1]
output_file = argv[2]
#t0 = time.time()
start = timer()
hits = []
with open(input_file, 'r') as data_in:
for line in data_in:
hits += [line.split()]
results = process(hits)
with open(output_file, 'a') as data_out:
data_out.writelines(results)
end = timer()
print("Elapsed Time: ", timedelta(seconds = end-start))
I have a doubt about this script however, to be more precise it is related to this part:
hits = []
with open(input_file, 'r') as data_in:
for line in data_in:
hits += [line.split()]
This way of splitting the line of the input file should be very inefficient compared to this one in which I use the function int_or_float(s)
:
with open(input_file, 'r') as data_in:
hits = [[int_or_float(s) for s in line.split()] for line in data_in]
However after testing with multiple input file (with sizes ranging from just a few MBs to a couple of GBs) I'm getting the best results, computational time wise, with the first method (the inefficient one). As a reference, analyzing the same input file of around 2.3GB the "inefficient" method takes around 50mins while the "function" method around 55mins; in both cases I'm using the same Python code to analyze the input file, the only difference is the method I use, "inefficient" or "function".
Any idea on why this could be happening?
Thank you for your help!
UPDATE 20/08
Thanks to @DarrylG I got a new version of the script that I had the chance to test this morning. I did comparison between my latest version, the one I posted on the 19/08 UPDATE of my original post, and the latest version of DarrylG that can be found in the accepted answer (the version is noted as "Faster"). I compared the time that both versions of the script need to generate a complete output file, the test has been done using input file of different sizes:
Input file size 224MB
- Computational time of my version: 4 minutes 10 seconds.
- Computational time of DarrylG version: 4 minutes 22 seconds.
Input file size 2.3GB
- Computational time of my version: 48 minutes 3 seconds.
- Computational time of DarrylG version: 42 minutes 2 seconds.
I also checked the output file that were generated to see if there were any difference, but they're identical.
Although this is far from being an in depth test, I believe it's safe to say that DarrylG latest version is the fastest one, especially when working with input file of large size.
Thanks everyone for your help, in case someone has more questions about the performance of the script feel free to contact me.