3

I have quite a big data file that is 200% available memory and I want to rename the columns and save it to a new file with a different name.

When I do the rename on a small sample, things work as expected i.e.

df = pd.read_csv(path, encoding="ISO-8859-1", engine='python', nrows=10)
print_columns(df)

rename_columns(df)
print_columns(df)
df.to_csv(path_to_save)

That works and renames the columns as expected but only saves the sampled ten lines of the big file.

When loading very big files, there a few options in Python:

1) read and process the big file line-by-line

I did this last time on another large file, but do I actually need that when just renaming columns?

2) chunking in pandas:

chunksize = 100000
for chunk in pd.read_csv(path, chunksize=chunksize, encoding="ISO-8859-1", engine='python'):
    print_columns(chunk)
    rename_columns(chunk)
    print_columns(chunk)

Obviously, I rename each chunk but the big question I have is how do I stitch all chunks back together in the correct order and save the big one?

3) Is there actually a good old shell command that would do the column rename a bit easier?

As background, I prepare the data for import into a database but need to keep the source file as it is, thus saving to a different file name.

smci
  • 32,567
  • 20
  • 113
  • 146
Marvin.Hansen
  • 1,436
  • 1
  • 15
  • 29
  • The header is typically just the first line, If all you need to do is edit the one-line header but not change the data, you can always do it outside Python, or manually, even an perl/awk/sed replacement. Do you need to do it programmatically? – smci Apr 24 '20 at 10:11
  • and if you want to write all the intermediate chunks without header and concatenate them together, use `df.to_csv(... , header=None, mode='a')` which does an **append, with no header**. – smci Apr 24 '20 at 10:23
  • Marvin.Hansen: if you're not requiring Python/pandas solutions then don't put 'Python/pandas' in the title. Also, don't [edit solutions into the question and change the title to 'SOLVED'](https://stackoverflow.com/revisions/61405879/4). – smci Apr 24 '20 at 10:52

2 Answers2

4

Could you write just a one row csv file with your column names and then just use a text editor/command line to place the other rows in?

import csv
columns = [['A','B','C','D','E']]
file_name = "column_file.csv"
with open(file_name, mode='w') as csv_file:
        csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        csv_writer.writerow(columns)

Now you have a column_file.csv with the column names. Delete the first row of the large_file.csv file and put it into an new file called new_large_file.csv.

sed '1d' large_file.csv > new_large_file.csv

Now append the new_large_file.csv to the file you created with just the column headers and save it in the file new_large_file.csv

cat column_file.csv new_large_file > final_file.csv

Should still have a copy of the original using this just in case anything goes wrong, hope it helps.

WK123
  • 620
  • 7
  • 18
  • Marvin.Hansen: yes I first posted that advice 25 minutes ago and recommended using perl/awk/sed replacement. If your question doesn't require Python/pandas solutions then please edit it accordingly to not be misleading. – smci Apr 24 '20 at 10:37
  • No worries! Have a look at the astype function for pandas aswell you might be able to save some memory if its numerical data by reading in as a 32 bit compared to a 64 bit int, may be useful in future but have to be careful https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html – WK123 Apr 24 '20 at 10:40
  • 1
    WK123 I marked your answer as the solution and updated the question accordingly. Essentially, the key is, I have to batch rename a lot of columns and that is done so nicely in Python when compared to sed. The simple rename function shown in the code is actually quite large... Then just concatenating the new headers with the no-headers big file is just the best solution around. Thank you so much for this hint. – Marvin.Hansen Apr 24 '20 at 10:51
3

For large files a simple command line solution with the stream editor sed might be faster than a python script:

sed -e '1 {/SFID/ {s/Id/IgnoreId/; s/SFID/Id/}}' -I myfile.csv

This changes Id to IgnoreId and SFID to Id in the first line if it contains SFID. If other column header also contain the string Id (e.g. ImportantId) then you'll have to refine the regexes in the s command accordingly.