I cannot fully understand the problem you encounter unless you provide the structure of the csvs you use and the result you get.
Can you provide a small example of csv files and the unwanted result so we could understand what the problem is?
To provide a small example, you could use df.head(2)
on two sample csv files.
Update
Suggested Solution
Here is a working example, provided by the example given here
The sample i used is:
df1 = pd.DataFrame(np.asarray([[1, 1], [2, 2]]), columns=['A', 'B'])
df2 = pd.DataFrame(np.asarray([[3, 3], [4, 4]]), columns=['A', 'B'])
df3 = pd.DataFrame(np.asarray([[5, 5], [6, 6]]), columns=['A', 'B'])
df1.to_csv('1.csv')
df2.to_csv('2.csv')
df3.to_csv('3.csv')
The resulting csvs:
A B
0 1 1
1 2 2
A B
0 3 3
1 4 4
A B
0 5 5
1 6 6
The concatenation code:
import os
import glob
import pandas as pd
path = r'.'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')]
df = pd.DataFrame()
for file_ in all_files:
file_df = pd.read_csv(file_, index_col=0, header=0)
file_df['file_name'] = file_
df = df.append(file_df)
print(df)
The concatenation result:
A B file_name
0 5 5 ./3.csv
1 6 6 ./3.csv
0 1 1 ./1.csv
1 2 2 ./1.csv
0 3 3 ./2.csv
1 4 4 ./2.csv
Notice that there are two differences from your code:
- The header parameter is set to 0 instead of None.
- The index_col parameter is set to 0 instead of 'None'.
Explanation
I guess the problems you encountered are related to these parameters so i will explain their usage.
Say you have the following csv file:
,A,B
0,1,1
1,2,2
Notice that the first row is the header, and the first row is the index.
When one uses pandas.read_csv(...)
function, than they will get the following result:
Unnamed: 0 A B
0 0 1 1
1 1 2 2
This means that pandas took the first column, which is the index column, and treated it as a regular data column.
To avoid this, one can to set the index_col parameter to 0.
This way pandas will know to parse this column as index.
So the result of pandas.read_csv(..., index_col=0)
will result in the following:
A B
0 1 1
1 2 2
Now if one reads a few csv files using this method, their concatenation will result in the wanted appearance which does not include their former index, as was shown in the example i provided.
Additionally, if i would've set the header parameter to be None, than the entire first row would be considered to be data, and we would've received the following unwanted result:
0 1 2
0 NaN A B
1 0.0 1 1
2 1.0 2 2
This would lead to an unwanted result in the concatenated dataframe as well.
Update 2
(The op has mentioned that the new column contains the path of the csv files instead of the basename of the files)
The reason for this problem is because in the line file_df['file_name'] = file_
.
you used file_ which iterates over all_files.
And indeed all_files contains the path to your csv files.
You kept the base names of your files in the names variable, so in order to get the only the base names of the csv files in the new column i suggest the following changes to the code presented above:
path = r'.'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in all_files]
df = pd.DataFrame()
for file_, name in zip(all_files, names):
file_df = pd.read_csv(file_, index_col=0)
file_df['file_name'] = name
df = df.append(file_df)
print(df)
The changes i made are:
- Corrected the parsing of the base names
names = [os.path.basename(x) for x in all_files]
- Used the base names in the new column
file_df['file_name'] = name