I'm trying to read multiple CSV files that have the same structure(column names)and located in several folders, My main purpose is to concatenate these files into one panda data frame. please find attached below files location distribution of folders, thus each folder contains 5 CSV files. Is there any predefined function or smth that can help ??
Asked
Active
Viewed 1,215 times
0
-
1There is the pd.concat() method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat – Daniel Labbe Feb 21 '19 at 15:49
-
similar to: https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe – vercelli Feb 21 '19 at 16:03
5 Answers
4
You might use glob.glob('*.csv')
to find all csvs and then concat them all.
import glob
import pandas as pd
csv_paths = glob.glob('*.csv')
dfs = [pd.read_csv(path) for path in csv_paths]
df = pd.concat(dfs)

Frenzy Kiwi
- 161
- 1
- 5
1
Using the os.walk() and pd.concat():
import os
import pandas as pd
outdir = [YOUR_INITIAL_PATH]
df_final = pd.DataFrame(columns=['column1', 'column2', 'columnN']) # creates an empty df with the desired structure
for root, dirs, filenames in os.walk(outdir):
for f in filenames:
if f.endswith('.csv'):
df_temp = pd.read_csv(root + '\\' + f)
df_final = pd.concat([df_final, df_temp])

Daniel Labbe
- 1,979
- 3
- 15
- 20
-
@ikbelbenabdessamad, there is no condition limiting the number of files read other than the file extension. I would check a few things on your code: 1) the path is not a list, it's a string. This is not clear in your code snippet; 2) if you are using windows, I would prefer to use escaped backslashes as: path = 'C:\\Users\\benabdi1\\CsvFilesProjects\\RoomData\\2016'; – Daniel Labbe Feb 22 '19 at 10:19
-
3) the file extension comparison should be made only if you have other files into the target directories. If you don't have, it's probably better to remove the if, once you may encounter different cases (.XLS, for instance). 4) did you change the pd.read_csv() method by the pd.read_excel()? – Daniel Labbe Feb 22 '19 at 10:19
1
You can use os.walk()
to iterate over files in directory tree (example). pd.read_csv()
will read a single file into a dataframe. pd.concat(df_list)
will concatenate all dataframes in df_list together.
I don't believe there is a single method that combines all the above for your convenience.

Leshiy
- 11
- 2
1
Frenzy Kiwi gave you the right answer. An alternative could be using dask
let's say your folder structure is
data
├── 2016
│ ├── file01.csv
│ ├── file02.csv
│ └── file03.csv
├── 2017
│ ├── file01.csv
│ ├── file02.csv
│ └── file03.csv
└── 2018
├── file01.csv
├── file02.csv
Then you can just read all of them via
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv("data/*/*.csv")
# convert to pandas via
df = df.compute()

rpanai
- 12,515
- 2
- 42
- 64
1
This is the best solution to this problem :
import os
import glob
import pandas as pd
def nested_files_to_df(path,ext):
paths = []
all_data = pd.DataFrame()
#--- Putting all files name in one list ---#
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(tuple(ext)):
s = os.path.join(root, file)
paths.append(s)
#--- Reading and merging all the existing excel files into one dataframe ---#
for f in paths:
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
return all_data
Calling the function :
df= nested_files_to_df('Your main folder root',[".xls",".XLS",".xlsx"])

Ikbel
- 1,817
- 1
- 17
- 30