1

I have to make an application in which I have to import all the excel files in the given folder and add it to a dataframe. The dataframe should look as shown:

Expected Data Frame

As seen in the image one of the columns for the dataframe is the name of the file.

I have successfully added that column in the final dataframe and the code is as follows:

import pandas as pd
import os
import shutil
import re  
path = 'C:/Users/Administrator/Desktop/Zerodha/Day2'
lst = os.listdir(path)
files = [os.path.join(path,x) for x in lst]
print(lst)
dataframes_lst = []

for file in files:
 filename = file.split('/')[-1]
 dataframe = pd.read_csv(file, usecols=[0,4], names ["date",filename],index_col=["date"])
dataframes_lst.append(dataframe)

df = pd.concat(dataframes_lst, axis=1)

print(df)


df.to_csv('data.csv')

The dataframe which is obtained using this code is as displayed:

Obtained Data Frame

For reference I will attach the snippet of one of the excel files:

Excel snippet

Also as seen there are many nan values obtained. I tried to remove them by using pd.dropna(inplace = True) function and also by doing as suggested in this post:

  1. How to drop rows of Pandas DataFrame whose value in a certain column is NaN

But the resultant dataframe still contains the nan values.

Huzefa Sadikot
  • 561
  • 1
  • 7
  • 22

2 Answers2

2

Regarding

My doubt is that how do I loop through all the files in the directory and extract data of each file in the required format

You can loop through all the files in the directory and extract the data and filename as the header of the dataframe as such:

import pandas as pd
import os

path = './data'
lst = os.listdir('./data/')
files = [os.path.join(path,el) for el in lst]

output_of_files

and the structure of example.xlsx is:

excel_structure

dataframes_lst = []

for file in files:
    filename = file.split('/')[-1]
    dataframe = pd.read_excel(file, usecols=[3], names=[filename])
    dataframes_lst.append(dataframe)

df = pd.concat(dataframes_lst, axis=1)
print(df)

Here, the dataframes are concatenated along the axis=1 and the output of print(df)is

output_final_df

0

Try this:

import pandas as pd    
from pathlib import Path
read_path = Path('C:/Users/Administrator/Desktop/Zerodha/Day2')
    
df = pd.concat([pd.read_csv(path) for path in read_path.glob('*.csv')])

If you want to read from excel just use read_excel and change the pattern to '*.xlsx'

skarit
  • 11
  • 1
  • 3
  • I am getting the data but not as expected in the Expected Data Frame. Your code gives me the data in the format mentioned in the Excel snippet. I need it in the format as mentioned in the Expected Data Frame – Huzefa Sadikot Nov 18 '20 at 09:33