0

I read a bunch of pickle files with the below code, I want to loop through and get each of these, identify the length of each file. Ie how many records.

Two issues:

  1. Concat will combine all my dfs into one, which takes a long time. Anyone to just read the len?
  2. If Concat is the way to go, how can I get the length of each file if they all go into one dataframe? I guess the problem is here to identify where each file stops and starts. I could add a column to identify each filename and count there I suspect.

What ive tried:

import pandas as pd
import glob, os


files = glob.glob('O:\Stack\Over\Flow\*.pkl')

df = pd.concat([pd.read_pickle(fp, compression='xz').assign(New=os.path.basename(fp)) for fp in files])

Any help would be appreciated.

Jonnyboi
  • 505
  • 5
  • 19

2 Answers2

2

Append to a list first then pd.concat due to quadratic copying undesired effects of appending or concatenating inside a for loop.

import pandas as pd
import glob, os

files = glob.glob('O:\Stack\Over\Flow\*.pkl')

dfs = []

for fp in files:
    df = pd.read_pickle(fp, compression='xz').assign(New=os.path.basename(fp)) 
    dfs.append(df)
    # or as @G.Anderson points out maybe
    dfs.append(len(df))

pd.concat(dfs)
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    The question is a bit confusing and I could be wrong, but the way I read it OP doesn't actually want the DFs at the end, they just want the length of each. So `dfs.append(df)` -> `dfs.append(len(df))` and eschew the `concat` altogether – G. Anderson Dec 15 '22 at 18:21
  • @G.Anderson You could be right. I agree. – Scott Boston Dec 15 '22 at 18:24
2

If you only want the lengths of individual dataframes, then the call to concat is entirely unnecessary overhead. To repurpose your own code, you're already building the dataframes from the files, you can just use those to capture only the lengths.

import pandas as pd
import glob, os


files = glob.glob('O:\Stack\Over\Flow\*.pkl')

#a call to assign should also be irrelevant because adding a column doesn't change the length
lens=[len(pd.read_pickle(fp, compression='xz')) for fp in files]

Or if you want to keep a dictionary of the filename with the length this should work:

lens = {os.path.basename(fp):len(pd.read_pickle(fp, compression='xz')) for fp in files}
G. Anderson
  • 5,815
  • 2
  • 14
  • 21