0

I have a folder full of CSV's amounting to a total of 50 files that contain a lot of different data, I am trying to combine them into files that will contain around 500000 entries, as I'll be taking around 10000 rows from every file if possible and combine put them into the new file, like repeat this process for 50 files to make a single file

Right now, I have this code that I've written which I am still stuck with and can not figure out how to continue from here

import os
import pandas as pd
from tkinter import filedialog
from tkinter import *


class ExcelHandler:
    def __init__(self):
        self.path = ""
        self.files = []
        self.firstFile = True

    def selectDirectory(self):
        """This function opens a file dialog to select a directory. It then sets the path to the directory and gets all the files in the directory."""
        root = Tk()
        root.withdraw()
        self.path = filedialog.askdirectory(title="Select a directory")
        self.files = os.listdir(os.path.abspath(self.path))
        self.xlsx_files = [i for i in self.files if i.endswith(".xlsx")]
        self.csv_files = [i for i in self.files if i.endswith(".csv")]

    def createNewSamples(self, fromRows, toRows):
        """This function opens csv's one by one and takes 10000 entries from each and adds them to new dataframe. after it exports the combined sheet"""
        if self.path and self.files:
            try:
                print(
                    "Creating new file containing 10000 entries from all the csv's..."
                )
                initial, final = fromRows, toRows
                while True:
                    try:
                        df_total = pd.DataFrame(
                            columns=[
                                "Company Name",
                                "Email Address",
                                "Address",
                                "City",
                                "State",
                                "Zipcode",
                                "Phone",
                                "Fax",
                                "Sic Code",
                                "Sic Description",
                                "Website",
                            ]
                        )
                        for file in self.csv_files:
                            if self.firstFile:
                                df = pd.read_csv(
                                    file,
                                    skiprows=initial,
                                    # chunksize=final,
                                    nrows=10000,
                                    low_memory=False,
                                )
                                self.firstFile = False
                            else:
                                df = pd.read_csv(
                                    file,
                                    skiprows=initial,
                                    # chunksize=final,
                                    nrows=10000,
                                    low_memory=False,
                                )
                            # df = df[initial:final]
                            df_total = pd.concat([df_total, df], ignore_index=True)
                        self.firstFile = True
                        df_total.to_csv(f"newData/Mixed-Data{final}.csv", index=False)
                        print(f"Mixed-Data{final}.csv created!")
                        initial, final = fromRows + initial, toRows + final
                    except:
                        print("Can not continue for some reason.")
                        break
            except:
                print("Error creating the csv.")
                return
        else:
            print("Please select a directory first using selectDirectory().")

Is there any way I can take 10000 or as many as the files have remaining for the last few rows sake, from each file, combine them into a single file and export it, and repeat the process till I run out of rows in every file. I would also like to keep the Header column as the same for every output file as all of them have the same header so combined files should keep the same header too, hence there I still can't figure out the way to do this

To Add on, I am using pandas to handle xlsx and csv files, and tkinter to help the people using this library to select the folder they want form where all the csv's and xlsx's are selected and put in an array

ExcelHandler is just a name for the class, when i was initially working on it I just named it this for no specific or special reason

  • Welcome to SO. Please update your question with the responses to my following points: 1) Does all CSV files have the same header/structure? 2) Why do you use `pandas` and `tkinter` here? 3) Your code does not work because it is not a minimal working example. There is IMHO no need for having tkinter or pandas in it. 4) What is a `ExcelHandler`? No Excel involved here. 5) Depending on your details I would say you could just concat the files together independed from that fact that they are CSV. Maybe https://stackoverflow.com/q/13613336/4865723 – buhtz Mar 03 '22 at 13:42
  • 1
    Updated the question accordingly, and I have no idea how to just concat the files together, can you please provide me with any example where something like the thing am trying to do is being achieved? – Saqlain Shah Mar 03 '22 at 13:59
  • Your code misses a `__main__` part. Please read https://stackoverflow.com/help/minimal-reproducible-example . `tkinter` is still part of that code. This does not help your question. About concating I gave you a link in my first comment. Have you checked that? – buhtz Mar 03 '22 at 14:22
  • 1
    the main function is missing because of the fact that this is just a class which is supposed to be imported in other code and be used there like any library. – Saqlain Shah Mar 03 '22 at 14:41
  • 1
    Yes I did check the link you gave me, I couldn't understand how I'll change my code following that example, yk my requirements are a bit werid that I can not totally grasp my head around – Saqlain Shah Mar 03 '22 at 14:41
  • Please read again very carefully https://stackoverflow.com/help/minimal-reproducible-example . With all respect and looking onto my own learning history on StackOverflow: "Grasping your head around" is exactly what you have to do here. While creating a very good and easy to understand question with a minimal(!) working(!) example (not copied from your production code!) you will learn and possible solve your problem by yourself. If not we will pleased to help you but only if you ask "good" questions. Think and clear up your "werid requirements" then we can help you. – buhtz Mar 03 '22 at 14:48

1 Answers1

0

My solution does open two CSV files, read them line by line, "glue" the lines together and write them to a third result file. It also checks for an expected header and takes care that this header, if it appears in each input file, is not repeated in the result file.

#!/usr/bin/env python3
import pathlib

def generate_sample_data():
    """This will generate two csv files."""

    with pathlib.Path('one.csv').open('w') as f:
        f.write('colA;colB;colC\n1;2;3\n4;5;6\n')

    with pathlib.Path('two.csv').open('w') as f:
        f.write('colA;colB;colC\n7;8;9\n10;11;12\n')

if __name__ == '__main__':
    # Generate sample data in two csv files
    generate_sample_data()

    csv_header = 'colA;colB;colC\n'
    result = []

    # open each input csv file
    for input_filename in ['one.csv', 'two.csv']:
        with pathlib.Path(input_filename).open('r') as in_file:
            # read the whole file and store them as a list line by line
            content = in_file.readlines()

            # remove the header if there is one
            if content[0] == csv_header:
                content = content[1:]

            result.extend(content)

    # insert the header in the first line
    result.insert(0, csv_header)

    # open the csv to where the result should be stored
    with pathlib.Path('one_and_two.csv').open('w') as out_file:
        out_file.writelines(result)

    # check the result
    with pathlib.Path('one_and_two.csv').open('r') as result_file:
        print(result_file.read())

If you need to do more computation and checks on your file content I would suggest the csv package in python itself. I think pandas is a bit to much for your needs here.

buhtz
  • 10,774
  • 18
  • 76
  • 149