loop through chunks of files of data

Question

I have many files, I split it into a group of five. I would like to loop through each group of chunk. I don't want to change the element one by one since there are over 500 groups. Is there a way to loop through it?

import glob
import numpy as np
import pandas as pd

path = r'/Users/Documents/Data'

files= sorted(glob.glob(path + '/**/*.dat', recursive=True))

chunks = [files[x:x+5] for x in range(0, len(files), 5)]. #group 5 files at a time
chunks = [['file1.dat', 'file2.dat', 'file3.data', 'file4.dat', 'file5.dat'], 
['file6.dat', 'file7.dat', 'file8.dat', 'file9.dat', 'file10.dat'], [...]]```

This work but I do not want to manually change the element 500 times.

df=[]
for i in chunks[0]: 
    indat = pd.read_fwf(i, skiprows=4, header=None, engine='python')
    indat = df.append(indat)
indat = pd.concat(df, axis=0, ignore_index=False)

I want to try some loop.

df=[]
for i, file in enumerate(chunks,1):
    indat = pd.read_fwf(file, skiprows=4, header=None, engine='python')
    indat = df.append(indat)

My attempt gave me error below:


  File "/Users/Documents/test.py", line 30, in <module>
    indat = pd.read_fwf(file, skiprows=4, header=None, engine='python')

  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 782, in read_fwf
    return _read(filepath_or_buffer, kwds)

  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 431, in _read
    filepath_or_buffer, encoding, compression

  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/common.py", line 200, in get_filepath_or_buffer
    raise ValueError(msg)

ValueError: Invalid file path or buffer object type: <class 'list'>```

why do u declare `chunks` just to immediately overwrite it? same with `indat` — seven_seas, Apr 23 '20 at 23:55

Trenton McKinney · Accepted Answer · 2020-04-24T17:48:27.887

0

If you want all the data in one dataframe

There is no reason to chunk it into groups of 5
Use pathlib, which is part of the standard library and treats paths as object, not strings
Create a list of dataframes with [pd.read_fsf(file) for file in files] and concat them.
axis=0, ignore_index=False are not included, because they are default values

from pathlib import Path
import pandas as pd

f_path = Path('c:/Users/.../Documents/Data')
files = sorted(list(f_path.glob('**/*.dat')))

df = pd.concat([pd.read_fsf(file, skiprows=4, header=None, engine='python') for file in files])

If you want a dataframe for each group

Create a dict of dataframes using a dict-comprehension

df_dict = {f'group_{i}': pd.concat([pd.read_fsf(file, skiprows=4, header=None, engine='python') for file in chunk]) for i, chunk in enumerate(chunks)}

edited Apr 24 '20 at 17:48

answered Apr 24 '20 at 00:59

Trenton McKinney

56,955
33
144
158

Thank you, the second option is what I want. When I `print(df_dict)` I will get `group_0, group_1, group_10, group_2` is there a way to do it in numerical number? @TrentonMcKinney – Tina Apr 27 '20 at 23:35
If you update to python 3.6 or higher dicts are automatically ordered or try a solution from here to order the dict: https://stackoverflow.com/questions/15711755/converting-dict-to-ordereddict – Trenton McKinney Apr 28 '20 at 00:45

loop through chunks of files of data

1 Answers1

If you want all the data in one dataframe

If you want a dataframe for each group