3

I need to convert a folder with around 4,000 .txt files into a single .csv with two columns: (1) Column 1: 'File Name' (as specified in the original folder); (2) Column 2: 'Content' (which should contain all text present in the corresponding .txt file).

Here you can see some of the files I am working with.

The most similar question to mine here is this one (Combine a folder of text files into a CSV with each content in a cell) but I could not implement any of the solutions presented there.

The last one I tried was the Python code proposed in the aforementioned question by Nathaniel Verhaaren but I got the exact same error as the question's author (even after implementing some suggestions):

import os
import csv

dirpath = 'path_of_directory'
output = 'output_file.csv'
with open(output, 'w') as outfile:
    csvout = csv.writer(outfile)
    csvout.writerow(['FileName', 'Content'])

    files = os.listdir(dirpath)

    for filename in files:
        with open(dirpath + '/' + filename) as afile:
            csvout.writerow([filename, afile.read()])
            afile.close()

    outfile.close()

Other questions which seemed similar to mine (for example, Python: Parsing Multiple .txt Files into a Single .csv File?, Merging multiple .txt files into a csv, and Converting 1000 text files into a single csv file) do not solve this exact problem I presented (and I could not adapt the solutions presented to my case).

jcs
  • 39
  • 1
  • 3

1 Answers1

-1

I had a similar requirement and so I wrote the following class

import os
import pathlib
import glob
import csv
from collections import defaultdict

class FileCsvExport:
    """Generate a CSV file containing the name and contents of all files found"""
    def __init__(self, directory: str, output: str, header = None, file_mask = None, walk_sub_dirs = True, remove_file_extension = True):
        self.directory = directory
        self.output = output
        self.header = header
        self.pattern = '**/*' if walk_sub_dirs else '*'
        if isinstance(file_mask, str):
            self.pattern = self.pattern + file_mask
        self.remove_file_extension = remove_file_extension
        self.rows = 0

    def export(self) -> bool:
        """Return True if the CSV was created"""
        return self.__make(self.__generate_dict())

    def __generate_dict(self) -> defaultdict:
        """Finds all files recursively based on the specified parameters and returns a defaultdict"""
        csv_data = defaultdict(list)
        for file_path in glob.glob(os.path.join(self.directory, self.pattern),  recursive = True):
            path = pathlib.Path(file_path)
            if not path.is_file():
                continue
            content = self.__get_content(path)
            name = path.stem if self.remove_file_extension else path.name
            csv_data[name].append(content)
        return csv_data

    @staticmethod
    def __get_content(file_path: str) -> str:
        with open(file_path) as file_object:
            return file_object.read()

    def __make(self, csv_data: defaultdict) -> bool:
        """
        Takes a defaultdict of {k, [v]} where k is the file name and v is a list of file contents.
        Writes out these values to a CSV and returns True when complete.
        """
        with open(self.output, 'w', newline = '') as csv_file:
            writer = csv.writer(csv_file, quoting = csv.QUOTE_ALL)
            if isinstance(self.header, list):
                writer.writerow(self.header)
            for key, values in csv_data.items():
                for duplicate in values:
                    writer.writerow([key, duplicate])
                    self.rows = self.rows + 1
        return True

Which can be used like so

...
myFiles = r'path/to/files/'
outputFile = r'path/to/output.csv'

exporter = FileCsvExport(directory = myFiles, output = outputFile, header = ['File Name', 'Content'], file_mask = '.txt')
if exporter.export():
    print(f"Export complete. Total rows: {exporter.rows}.")

In my example directory, this returns

Export complete. Total rows: 6.

Note: rows does not count the header if present

This generated the following CSV file:

"File Name","Content"
"Test1","This is from Test1"
"Test2","This is from Test2"
"Test3","This is from Test3"
"Test4","This is from Test4"
"Test5","This is from Test5"
"Test5","This is in a sub-directory"

Optional parameters:

  • header: Takes a list of strings that will be written as the first line in the CSV. Default None.
  • file_mask: Takes a string that can be used to specify the file type; for example, .txt will cause it to only match .txt files. Default None.
  • walk_sub_dirs: If set to False, it will not search in sub-directories. Default True.
  • remove_file_extension: If set to False, it will cause the file name to be written with the file extension included; for example, File.txt instead of just File. Default True.
Lucan
  • 2,907
  • 2
  • 16
  • 30