1

At my company the sales data per month is stored in a folder as a CVS file. To already speed up to reading process in Python I am transforming the CSV files to Pickle files. Right now I have the following code to read all the individual pickle files and append them together in the dataframe:

import os, glob
import pandas as pd
import glob
import os.path

# Enter path of folder#
path = "link to the folder"

# find all pickle files
all_files = glob.glob(path + "/*.pkl")
df = pd.concat(
    (pd.read_pickle(file).assign(filename=file) for file in all_files),
    ignore_index=True,
)

I have 38 individual pickle files and the total size of the pickle files are 95 MB. This doesn't seem like a lot to me, but still it takes 56s to load all data into the dataframe.

Is there anything that can speed up this proces? Many thanks in advance!

Best, Kav

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • Have you measured where your program is taking most of its time using a profiler? If not, you should. – AKX Aug 31 '22 at 07:36
  • Secondly, try adding `copy=False` to the `concat` call. – AKX Aug 31 '22 at 07:37
  • Also – please consider a database instead of a folder full of CSV files. – AKX Aug 31 '22 at 07:41
  • Thanks for the help! We have a database and I am trying to work with IT to connect the python script to the TABULAR model. For now, I have to do it with the CSV (pickle) files. – Kavish Sewmangal Aug 31 '22 at 09:42
  • "We have a database" makes little sense. You can have a database too by taking those (pickled) CSVs and calling `.to_sql()` against a SQLite connection on them. – AKX Aug 31 '22 at 09:44
  • I haven't considered that actually. So in this case, I would store all pickle (or just CSV?) files to the SQL database and then connect my Python script to the SQL database. would that improve the speed? – Kavish Sewmangal Aug 31 '22 at 09:52
  • Most likely. Then you'll also have all of SQL's querying and reporting capabilities at your hands. – AKX Aug 31 '22 at 09:52

0 Answers0