I have several txt files. I am reading them in and I want to store them in a single file. My code so far that works is:
import pandas as pd
import glob
import csv
import os
col_lengths = {'Column1': range(0, 2),
'Column2': range(3, 11),
}
col_lengths = {k: set(v) for k, v in col_lengths.items()}
path = r"C:\myfolder"
all_files = glob.glob(os.path.join(path, "*.txt"))
with open(os.path.join(path, "output_file.csv"), "wb") as f:
for i, filename in enumerate(all_files):
print(filename)
df = pd.read_fwf(
filename,
colspecs=[(min(x), max(x)+1) for x in col_lengths.values()],
header=None,
names=col_lengths,
converters={'Column1':lambda x : str(x),
'Column2':lambda x : str(x),
}
)
df.to_csv(
f,
header=(i == 0),
sep=";",
decimal=",",
mode="wb",
quoting=csv.QUOTE_MINIMAL)
The code works so far. When I thought about it I wasn't sure about the df.to_csv statement. I have a with open before, but why is it actually appending the data? It is not something like mode='a'. So does it append the data because the file is still open due to the with open before? If I wouldn't have it, but still the loop over the filenames, then it would just rewrite the csv each time? So in final just store one dataframe with the data of the last file?
And what about the differences to the approach of not having a with open and a wb mode, but instead have the appending mode in the df.to_csv statement? Especially with regards to very large files and limited RAM. Is it correct, that append basically each time opens the file and in case the file is already really large, as some files where processed this could throw a memory error, as the append mode needs to open the file again? While as with with open and just writing to it this would not occur? But why exactly would it not occur, because the file itself is stored on the disk and not so much RAM is needed, as the data is not read into the RAM? I have troubles understanding or imaging this, because with open I also have at some point a really large csv file? But probably not inside the RAM?