1

I'm trying to read out data from multiple csv files in a folder, selecting specific column, adding the file name as an additional column and finally writing this to an existing excel.

The print output does what I'd like to do, but I don't seem to get the part working, where the data are attached to the excel sheet.

import requests
import os
import pandas as pd
import time
import xlwt
import glob


files = glob.glob("*.csv") 

writer = pd.ExcelWriter('output.xls', engine='xlsxwriter')
writer.close()

for i in files:
    
    df=pd.read_csv(i, usecols = ['column1', 'column2'])
    df['Filename Column'] = i.split(".")[0]
    df.to_csv(i.split(".")[0]+".csv")
    print(df)
    df_combined = pd.concat([df])
    reader = pd.read_excel(r'output.xls')
    df_combined.to_excel(writer, index=False, startrow=len(reader)+1)

The Excel remains empty, when I do it like that.

nmy2015
  • 133
  • 1
  • 6

1 Answers1

0

This code is what I've come up with.

import glob
import pandas as pd
from typing import List


if __name__ == "__main__":
    final_df: pd.DataFrame = pd.read_excel("dataset.xlsx", engine="openpyxl")

    files: List = glob.glob("*.csv")
    for file in files:
        added_df: pd.DataFrame = pd.read_csv(file, usecols=["column1", "column3"])
        added_df["Filename"] = file.split(".")[0]
        final_df: pd.DataFrame = pd.concat([final_df, added_df])

    final_df.to_excel("dataset.xlsx", index=False)
GregoirePelegrin
  • 1,206
  • 2
  • 7
  • 23
  • Thanks Gregoire, it seems to somewhat work except the final_df.to_excel Part. Here's the errors I get in the log: File "C:\Users\username\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper return func(*args, **kwargs) File "C:\Users\username\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper return func(*args, **kwargs) File "C:\Users\username\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\generic.py", line 2363, in to_excel – nmy2015 Mar 07 '23 at 12:21
  • Do you have any more information (output, error traceback, ...) as to why it does not work? – GregoirePelegrin Mar 07 '23 at 12:22
  • formatter = ExcelFormatter( File "C:\Users\username\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\formats\excel.py", line 521, in __init__ if not isinstance(df, DataFrame): TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union – nmy2015 Mar 07 '23 at 12:24
  • 1
    I've just updated the code, this is currently working on my end. – GregoirePelegrin Mar 07 '23 at 13:27
  • Thanks, this does the trick now. Only thing that I'm not getting to work now anymore is the part, where I add the file name in an extra column. Tried it this way in the for loop: added_df['TID']:pd.DataFrame = i.split(".")[0] added_df.to_csv(i.split(".")[0]+".csv") – nmy2015 Mar 08 '23 at 07:07
  • And after doing the same again with a new clean xls file, despite not changing the code, I get the following error... raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file – nmy2015 Mar 08 '23 at 08:32
  • Regarding the error, mind that I use `xlsx` files here, and not `xls` files. [This is not the same thing, and tends to produce this exact error](https://stackoverflow.com/a/65267285/16521194) – GregoirePelegrin Mar 08 '23 at 08:43
  • 1
    Ok, it seems, it works, when I just create a new xls file again. I used the old one for retrying and just deleted data out of it, which lead to this zip-file-error. – nmy2015 Mar 08 '23 at 08:54
  • 1
    I've also just edited my answer for your `"Filename"` column to be added. You had the first line right, while the second was useless, I don't know where you put them though, so it could be where was the mistake. Even though the second line was useless, I can't not point to you that `i.split(".")[0]+".csv"` is redundant. `i = "filename.csv"`, thus `i.split(".")[0] = "filename"`, hence `i.split(".")[0]+".csv" = "filename.csv" = i`. – GregoirePelegrin Mar 08 '23 at 08:55