I am using Python 3 and pandas(pd.read_csv) to read the files. There are no headers and the separator is ' |, | '. Also, the files are not .csv files and the operating system is CentOS.
There are 30,000 files in a folder with a total size of 10GB. Each file has about 50-100 rows and 1500 columns. I read each csv file (using read_csv) do some operations on it and store them in a list via a for loop. At the end of the process I have a list of dataframes. I was wondering how to speedup the process. Only 10 of the columns are relevant and so I use the argument usecols to filter. The cell inputs are strings so I convert them to float using df.astype(float).
Note that I have to do my operations on each of the files separately and only then append them all together.
I tried to use modin but it led to a mutliple decrease in speed. Also using modin leads to the indexes in each dataframe being repeated multiple times which didn't happen with normal pandas.