0

Hello everyone i am trying this piece of code but i didn't get my desired output,please help me for getting proper result .

import os
import glob   
import pandas as pd


path = r'filespath'

all_files = glob.glob(os.path.join(path, "*.csv"))

names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')]

df = pd.DataFrame()

for file_ in all_files:
    file_df = pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None )
    file_df['file_name'] = file_
    df = df.append(file_df)
    print(df)

I want to add multiple CSV files in one csv file and add csv file name in first column , if anyone have idea please share .thanks

Gal Avineri
  • 524
  • 5
  • 13
  • looking forward for solution ,please if anyone can help –  Nov 16 '18 at 15:02
  • Hi waquar ali, welcome to SO. It will be great if you can produce a [mcve](https://stackoverflow.com/help/mcve) – rpanai Nov 16 '18 at 17:58

2 Answers2

2

I cannot fully understand the problem you encounter unless you provide the structure of the csvs you use and the result you get.

Can you provide a small example of csv files and the unwanted result so we could understand what the problem is?

To provide a small example, you could use df.head(2) on two sample csv files.

Update

Suggested Solution

Here is a working example, provided by the example given here

The sample i used is:

df1 = pd.DataFrame(np.asarray([[1, 1], [2, 2]]), columns=['A', 'B'])
df2 = pd.DataFrame(np.asarray([[3, 3], [4, 4]]), columns=['A', 'B'])
df3 = pd.DataFrame(np.asarray([[5, 5], [6, 6]]), columns=['A', 'B'])
df1.to_csv('1.csv')
df2.to_csv('2.csv')
df3.to_csv('3.csv')

The resulting csvs:

   A  B
0  1  1
1  2  2

   A  B
0  3  3
1  4  4

   A  B
0  5  5
1  6  6

The concatenation code:

import os
import glob
import pandas as pd

path = r'.'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')]
df = pd.DataFrame()

for file_ in all_files:
    file_df = pd.read_csv(file_, index_col=0, header=0)
    file_df['file_name'] = file_
    df = df.append(file_df)
print(df)

The concatenation result:

   A  B file_name
0  5  5   ./3.csv
1  6  6   ./3.csv
0  1  1   ./1.csv
1  2  2   ./1.csv
0  3  3   ./2.csv
1  4  4   ./2.csv

Notice that there are two differences from your code:

  1. The header parameter is set to 0 instead of None.
  2. The index_col parameter is set to 0 instead of 'None'.

Explanation

I guess the problems you encountered are related to these parameters so i will explain their usage.

Say you have the following csv file:

,A,B
0,1,1
1,2,2

Notice that the first row is the header, and the first row is the index.

When one uses pandas.read_csv(...) function, than they will get the following result:

   Unnamed: 0  A  B
0           0  1  1
1           1  2  2

This means that pandas took the first column, which is the index column, and treated it as a regular data column. To avoid this, one can to set the index_col parameter to 0.
This way pandas will know to parse this column as index. So the result of pandas.read_csv(..., index_col=0) will result in the following:

   A  B
0  1  1
1  2  2

Now if one reads a few csv files using this method, their concatenation will result in the wanted appearance which does not include their former index, as was shown in the example i provided.

Additionally, if i would've set the header parameter to be None, than the entire first row would be considered to be data, and we would've received the following unwanted result:

     0  1  2
0  NaN  A  B
1  0.0  1  1
2  1.0  2  2

This would lead to an unwanted result in the concatenated dataframe as well.

Update 2

(The op has mentioned that the new column contains the path of the csv files instead of the basename of the files)

The reason for this problem is because in the line file_df['file_name'] = file_.
you used file_ which iterates over all_files.
And indeed all_files contains the path to your csv files.
You kept the base names of your files in the names variable, so in order to get the only the base names of the csv files in the new column i suggest the following changes to the code presented above:

path = r'.'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in all_files]
df = pd.DataFrame()

for file_, name in zip(all_files, names):
    file_df = pd.read_csv(file_, index_col=0)
    file_df['file_name'] = name
    df = df.append(file_df)
print(df)

The changes i made are:

  1. Corrected the parsing of the base names names = [os.path.basename(x) for x in all_files]
  2. Used the base names in the new column file_df['file_name'] = name
Gal Avineri
  • 524
  • 5
  • 13
  • for your CSV file what results did you get ,please can you send screenshot? –  Nov 16 '18 at 17:12
  • Hi @Gal given that you updated the answer you should state so. Using an **UPDATE** before the added text. – rpanai Nov 16 '18 at 18:17
  • Thank you for detail ,its very helpful but still have one more problem. in filename column i get very strange results ,i get in filename columns my direction path,but i require only csv file name in file name column –  Nov 16 '18 at 18:26
  • @waqarali I see :) i've added the solution to this problem in the answer :) – Gal Avineri Nov 16 '18 at 18:57
  • @Gal Avineri Thanks , you make my life easy :) your solution works –  Nov 16 '18 at 19:11
  • I'm glad i could've helped you! :D – Gal Avineri Nov 16 '18 at 19:23
1

I will suggest you to create a function for read and append the filename then you can loop through all_files

import pandas as pd
import numpy as np
import os
import glob

path = 'folder'
if not os.path.exists(path):
    os.makedirs(path)


def fun(fn):
    df = pd.read_csv(fn)
    # edit so file_name is the first col
    cols = df.columns.tolist()
    cols = ["file_name"] + cols
    # end edit
    df["file_name"] = fn
    return df[cols]

N = 100
for i in range(10):
    df = pd.DataFrame(np.arange(i*N, (i+1)*N))
    df.to_csv("{}/file_{:02}.csv".format(path, i), index=False)

all_files = sorted(glob.glob(os.path.join(path, "*.csv")))

dfs = [fun(fn) for fn in all_files]
df = pd.concat(dfs, ignore_index=True)
df.to_csv("single_file.csv", index=False)
rpanai
  • 12,515
  • 2
  • 42
  • 64