0

I have several txt files. I am reading them in and I want to store them in a single file. My code so far that works is:

import pandas as pd
import glob
import csv
import os

col_lengths = {'Column1': range(0, 2), 
                   'Column2': range(3, 11), 
              }
col_lengths = {k: set(v) for k, v in col_lengths.items()}

path = r"C:\myfolder"
all_files = glob.glob(os.path.join(path, "*.txt"))

with open(os.path.join(path, "output_file.csv"), "wb") as f:
    for i, filename in enumerate(all_files):
        print(filename)
        df = pd.read_fwf(
            filename,
            colspecs=[(min(x), max(x)+1) for x in col_lengths.values()],
            header=None,
            names=col_lengths,
            converters={'Column1':lambda x : str(x),
                          'Column2':lambda x : str(x),
                       }
        )
        df.to_csv(
        f,
        header=(i == 0), 
        sep=";",
        decimal=",",
        mode="wb",
        quoting=csv.QUOTE_MINIMAL)

The code works so far. When I thought about it I wasn't sure about the df.to_csv statement. I have a with open before, but why is it actually appending the data? It is not something like mode='a'. So does it append the data because the file is still open due to the with open before? If I wouldn't have it, but still the loop over the filenames, then it would just rewrite the csv each time? So in final just store one dataframe with the data of the last file?

And what about the differences to the approach of not having a with open and a wb mode, but instead have the appending mode in the df.to_csv statement? Especially with regards to very large files and limited RAM. Is it correct, that append basically each time opens the file and in case the file is already really large, as some files where processed this could throw a memory error, as the append mode needs to open the file again? While as with with open and just writing to it this would not occur? But why exactly would it not occur, because the file itself is stored on the disk and not so much RAM is needed, as the data is not read into the RAM? I have troubles understanding or imaging this, because with open I also have at some point a really large csv file? But probably not inside the RAM?

PSt
  • 97
  • 11
  • 3
    Yes, while a file is open, writes to it are appended to what has been written before. The mode parameter specifies what happens/is allowed when the file in initially opened – JonSG Jan 13 '23 at 16:31
  • Thanks and what about the differences compared to the approach of not using a with open, but instead the df.to_csv statement with append mode? – PSt Jan 13 '23 at 16:34
  • So, the `a` parameter to `open()` at a high level says that when the file is initially opened with the intention to potentially write to it that the file is not emptied out. While the `w` mode says we want to potentially write, but we will initially start with an empty (or emptied out) file. This is a simplification and you can read more about the various mode parameters here: https://stackoverflow.com/questions/1466000/difference-between-modes-a-a-w-w-and-r-in-built-in-open-function – JonSG Jan 13 '23 at 16:39
  • Ok, but what about the other approach I mentioned: What if I do not combine with open and writing to a csv file, but instead each time append the dataframe (using df_to_csv with append mode) to one file? – PSt Jan 13 '23 at 16:52
  • 1
    using `open()` in a loop tends to be expensive. Better in general to open once and stream data into it. – JonSG Jan 13 '23 at 16:55

0 Answers0