Issue in concatenating datasets

Question

I have around 500 datasets in a folder that I wish to concatenate. They all have the same column names: 'Year', 'ZIP Code, 'Var1', 'Var2', 'Var3'.

I used the following code to loop through the files in the folder:

directory = '/MyDirectory'
os.chdir(directory) 
files = os.listdir()

for f in files:
    if f.endswith('.csv'):
        combined_dataset = pd.concat([pd.read_csv(f)])

When I output the dataset, only the dataset for the year 2019 and zip code 000001 appears. I printed the whole list of files and the datasets I'm seeking to concatenate are all there. Any insight into why this might be the case? Thanks!

score 1 · Answer 1 · answered Nov 05 '21 at 21:28

1

When you want to concatenate df1 with df2 you have to:

pd.concat([df1, df2], axis = 1)

I recommend you to create a new dataframe and concatenate the loaded files.

combined_dataset = pd.DataFrame()

for f in files:
    if f.endswith('.csv'):
        combined_dataset = pd.concat([combined_dataset , pd.read_csv(f)], axis = 1)

answered Nov 05 '21 at 21:28

nunodsousa

2,635
4
27
49

This is certainly correct, concat on a list of 1 will not grow the DataFrame and only overwrite it each iteration. Having said that, [NEVER grow a DataFrame!](https://stackoverflow.com/a/56746204/15497888) the time complexity is quadratic and can be extremely slow, especially in combination with `read_csv` which is already a very slow read operation. – Henry Ecker Nov 05 '21 at 21:34
You can store in a list all now dataframes and then you concatenate in the end. easy peasy :D – nunodsousa Nov 05 '21 at 22:00

Issue in concatenating datasets

1 Answers1