1

My program should read ~400.000 csv files and it takes very long. The code I use is:

        for file in self.files:
            size=2048
            csvData = pd.read_csv(file, sep='\t', names=['acol', 'bcol'], header=None, skiprows=range(0,int(size/2)), skipfooter=(int(size/2)-10))

            for index in range(0,10):
                s=s+float(csvData['bcol'][index])
            s=s/10
            averages.append(s)
            time=file.rpartition('\\')[2]
            time=int(re.search(r'\d+', time).group())
            times.append(time)

Is there a chance to increase the speed?

prody
  • 194
  • 1
  • 11
  • You could use multithreading / subprocesses to fasten up things. Have a look at https://stackoverflow.com/questions/44950893/processing-huge-csv-file-using-python-and-multithreading for a similar problem. – AnsFourtyTwo Sep 23 '19 at 08:49
  • https://stackoverflow.com/questions/52289386/loading-multiple-csv-files-of-a-folder-into-one-dataframe maybe this also helps – PV8 Sep 23 '19 at 08:52

1 Answers1

0

You can use Threading. I took the following code from here and modified for your use case

global times =[]

def my_func(file):
        size=2048
        csvData = pd.read_csv(file, sep='\t', names=['acol', 'bcol'], header=None, skiprows=range(0,int(size/2)), skipfooter=(int(size/2)-10))

        for index in range(0,10):
            s=s+float(csvData['bcol'][index])
        s=s/10
        averages.append(s)
        time=file.rpartition('\\')[2]
        time=int(re.search(r'\d+', time).group())
        times.append(time)

threads = []
# In this case 'self.files' is a list of files to be read.
for ii in range(self.files):
# We start one thread per file present.
    process = Thread(target=my_func, args=[ii])
    process.start()
    threads.append(process)
# We now pause execution on the main thread by 'joining' all of our started threads.
# This ensures that each has finished processing the urls.
for process in threads:
    process.join()
rawwar
  • 4,834
  • 9
  • 32
  • 57