I have quite a big data file that is 200% available memory and I want to rename the columns and save it to a new file with a different name.
When I do the rename on a small sample, things work as expected i.e.
df = pd.read_csv(path, encoding="ISO-8859-1", engine='python', nrows=10)
print_columns(df)
rename_columns(df)
print_columns(df)
df.to_csv(path_to_save)
That works and renames the columns as expected but only saves the sampled ten lines of the big file.
When loading very big files, there a few options in Python:
1) read and process the big file line-by-line
I did this last time on another large file, but do I actually need that when just renaming columns?
2) chunking in pandas:
chunksize = 100000
for chunk in pd.read_csv(path, chunksize=chunksize, encoding="ISO-8859-1", engine='python'):
print_columns(chunk)
rename_columns(chunk)
print_columns(chunk)
Obviously, I rename each chunk but the big question I have is how do I stitch all chunks back together in the correct order and save the big one?
3) Is there actually a good old shell command that would do the column rename a bit easier?
As background, I prepare the data for import into a database but need to keep the source file as it is, thus saving to a different file name.