0

I have a script that loops through ~335k filenames, opens the fits-tables from the filenames, performs a few operations on the tables and writes the results to a file. In the beginning the loop goes relatively fast but with time it consumes more and more RAM (and CPU resources, I guess) and the script also gets slower. I would like to know how can I improve the performance/make the code quicker. E.g. is there a better way to write to the output file (open the output file ones and do everything within a while-open loop vs opening the file every time a new to write in it)? Is there a better looping way? Can I dump memory that I don't need anymore?

My script looks like that:

#spectral is a package for manipulation of spectral data
from spectral import *

# I use this dictionary to store functions, that I don't want to generate a new each time I need them.
# Generating them a new would be more time consuming, I figured out
lam_resample_dic = {}

with open("/home/bla/Downloads/output.txt", "ab") as f:

    for fname, ind in zip(list_of_fnames, range(len(list_of_fnames))):
        data_s = Table.read('/home/nestor/Downloads/all_eBoss_QSO/'+fname, format='fits')
        # lam_str_identifier is just the dic-key I need for finding the corresponding BandResampler function from below
        lam_str_identifier = ''.join([str(x) for x in data_s['LOGLAM'].data.astype(str)])

        if lam_str_identifier not in lam_resample_dic:
            # BandResampler is the function I avoid doing everytime a new
            # I do it only if necessary - when lam_str_identifier indicates a unique new set of data
            resample = BandResampler(centers1=10**data_s['LOGLAM'], centers2=df_jpas["Filter.wavelength"].values, fwhm2=df_jpas["Filter.width"].values)
            lam_resample_dic[lam_str_identifier] = resample
            photo_spec = np.around(resample(data_s['FLUX']),4)

        else:
            photo_spec = np.around(lam_resample_dic[lam_str_identifier](data_s['FLUX']),4)

        np.savetxt(f, [photo_spec], delimiter=',', fmt='%1.4f')

        # this is just to keep track of the progress of the loop
        if ind%1000==0:
            print('num of files processed so far:',find)

Thanks for any suggestions!

NeStack
  • 1,739
  • 1
  • 20
  • 40
  • 1
    Won't make it faster but: `fname, ind in zip(list_of_fnames, range(len(list_of_fnames)))` should be `ind,fname in enumerate(list_of_fnames):`. https://docs.python.org/3/library/functions.html#enumerate – wwii Dec 01 '21 at 20:55
  • When you run it on a few of the files do you get correct results? – wwii Dec 01 '21 at 20:59
  • How big is the result of `np.savetxt`? Do all the files produce a similarly sized result? You could try keeping each text result in a list then write the list contents to the file once. Or save individual modified files then go back and [concatenate them](https://stackoverflow.com/questions/13613336/how-do-i-concatenate-text-files-in-python). – wwii Dec 01 '21 at 21:05
  • @wwii Yes, thanks, when running on only a few (thousand) files the results are correct. Then I see that the script gets slower with time. And my first idea was also just to store the results into a large array and after all the files are done, then to save the large array into a file. But this was also getting slower with time, with a similar rate. So my thinking was that maybe if I write the results straight to a file this might free up RAM that else would be consumed for storing the results into the large array. Is my thinking wrong? any other suggestions in this regard? – NeStack Dec 01 '21 at 21:08
  • Probably not enough information in your question. What datatype is `data_s`? Can you provide a link to the docs of the `spectral` package? Does `Table.read('...` close the file when it returns? `lam_resample_dic` is the only thing for sure that *grows* with time. I really don't know what happens to memory when you open a file for appending then do a lot of writes to it; you could do an experiment to see if that may be the culprit - just open a file for appending and write fake data to it in a loop. – wwii Dec 01 '21 at 21:30
  • It is hard to help you without indication of what exactly is the bottleneck - please use `line_profiler` and add results to the question. – dankal444 Dec 01 '21 at 21:56

0 Answers0