Pandas, read Length only of all pickle files in directory

Question

I read a bunch of pickle files with the below code, I want to loop through and get each of these, identify the length of each file. Ie how many records.

Two issues:

Concat will combine all my dfs into one, which takes a long time. Anyone to just read the len?
If Concat is the way to go, how can I get the length of each file if they all go into one dataframe? I guess the problem is here to identify where each file stops and starts. I could add a column to identify each filename and count there I suspect.

What ive tried:

import pandas as pd
import glob, os


files = glob.glob('O:\Stack\Over\Flow\*.pkl')

df = pd.concat([pd.read_pickle(fp, compression='xz').assign(New=os.path.basename(fp)) for fp in files])

Any help would be appreciated.

It's not entirely clear what you're asking for, do you _also_ want the concatenated dataframe at the end or _only_ a list of all of the individual df lengths? — G. Anderson, Dec 15 '22 at 18:21

Scott Boston · Answer 1 · 2022-12-15T18:25:43.333

2

Append to a list first then pd.concat due to quadratic copying undesired effects of appending or concatenating inside a for loop.

import pandas as pd
import glob, os

files = glob.glob('O:\Stack\Over\Flow\*.pkl')

dfs = []

for fp in files:
    df = pd.read_pickle(fp, compression='xz').assign(New=os.path.basename(fp)) 
    dfs.append(df)
    # or as @G.Anderson points out maybe
    dfs.append(len(df))

pd.concat(dfs)

edited Dec 15 '22 at 18:25

answered Dec 15 '22 at 18:16

Scott Boston

147,308
15
139
187

1

The question is a bit confusing and I could be wrong, but the way I read it OP doesn't actually want the DFs at the end, they just want the length of each. So `dfs.append(df)` -> `dfs.append(len(df))` and eschew the `concat` altogether – G. Anderson Dec 15 '22 at 18:21
@G.Anderson You could be right. I agree. – Scott Boston Dec 15 '22 at 18:24

G. Anderson · Answer 2 · 2022-12-15T20:44:42.480

2

If you only want the lengths of individual dataframes, then the call to concat is entirely unnecessary overhead. To repurpose your own code, you're already building the dataframes from the files, you can just use those to capture only the lengths.

import pandas as pd
import glob, os


files = glob.glob('O:\Stack\Over\Flow\*.pkl')

#a call to assign should also be irrelevant because adding a column doesn't change the length
lens=[len(pd.read_pickle(fp, compression='xz')) for fp in files]

Or if you want to keep a dictionary of the filename with the length this should work:

lens = {os.path.basename(fp):len(pd.read_pickle(fp, compression='xz')) for fp in files}

edited Dec 15 '22 at 20:44

answered Dec 15 '22 at 18:30

G. Anderson

5,815
2
14
21

thanks Gary! this is what i am looking for ! although Lens doesnt have the file name, so I am not sure which length belongs to which file. – Jonnyboi Dec 15 '22 at 19:23
See the edit, using a dictionary comprehension instead of a list comprehension – G. Anderson Dec 15 '22 at 20:45
thanks! how would I convert this to a dataframe? – Jonnyboi Dec 16 '22 at 02:08
[Convert python dictionary into a dataframe](https://stackoverflow.com/questions/18837262/convert-python-dict-into-a-dataframe) – G. Anderson Dec 16 '22 at 18:11

Pandas, read Length only of all pickle files in directory

2 Answers2