0

I have around 500 datasets in a folder that I wish to concatenate. They all have the same column names: 'Year', 'ZIP Code, 'Var1', 'Var2', 'Var3'.

I used the following code to loop through the files in the folder:

directory = '/MyDirectory'
os.chdir(directory) 
files = os.listdir()

for f in files:
    if f.endswith('.csv'):
        combined_dataset = pd.concat([pd.read_csv(f)])

When I output the dataset, only the dataset for the year 2019 and zip code 000001 appears. I printed the whole list of files and the datasets I'm seeking to concatenate are all there. Any insight into why this might be the case? Thanks!

panino00
  • 13
  • 4

1 Answers1

1

When you want to concatenate df1 with df2 you have to:

pd.concat([df1, df2], axis = 1)

I recommend you to create a new dataframe and concatenate the loaded files.

combined_dataset = pd.DataFrame()

for f in files:
    if f.endswith('.csv'):
        combined_dataset = pd.concat([combined_dataset , pd.read_csv(f)], axis = 1)
nunodsousa
  • 2,635
  • 4
  • 27
  • 49
  • This is certainly correct, concat on a list of 1 will not grow the DataFrame and only overwrite it each iteration. Having said that, [NEVER grow a DataFrame!](https://stackoverflow.com/a/56746204/15497888) the time complexity is quadratic and can be extremely slow, especially in combination with `read_csv` which is already a very slow read operation. – Henry Ecker Nov 05 '21 at 21:34
  • You can store in a list all now dataframes and then you concatenate in the end. easy peasy :D – nunodsousa Nov 05 '21 at 22:00