0

I know that with Pandas, you can use the CSV writer in "append" mode to add new rows to the file, but I'm wondering, is there a way to add a new column to an existing file, without having to first load the file like:

df = pd.read_csv/excel/parquet("the_file.csv")

Reason I ask is, sometimes I'm dealing with huge datasets, and loading them into memory is expensive when all I'd like to do is just add 1 column to the file.

As an example, I have a huge dataset stored already, I load one column from that dataset to perform a calculation from it which gives me another column of data. Now I'd like to add that new column, same length of rows and everything, to the file, without first importing it. Possible?

Here's is a reproducible code if needed. I'm using this on much larger datasets, but the premise would be the exact same regardless:

from sklearn.datasets import make_classification
from pandas import DataFrame, read_csv

# Make a fake binary classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_classes=2)

# Turn it into a dataframe
df = DataFrame(X, columns=['col1','col2','col3','col4','col5','col6','col7','col8','col9','col10'])
df['label'] = y

# Save the file
df.to_csv("./the_file.csv", index=False)

# Now, load one column from that file
main_col = read_csv("./the_file.csv", usecols=["col1"])

# Perform some random calculation to get a new column
main_col['new_col'] = main_col / 2

Now, how can you add main_col['new_col'] to ./the_file.csv, without first importing the entire file, adding the column, then resaving?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
wildcat89
  • 1,159
  • 16
  • 47
  • You need to give specifics on the format of your file and the calculations involved. Provide input/output examples – mozway Oct 17 '21 at 07:53
  • None of that matters. For ease, we can assume the stored file is in .CSV format. I read in one column of that file with the command `main_col = pd.read_csv("./the_file.csv", usecols=['main_col'])`. Now for arguments sake, let's say I just make a direct copy of that column with `added_col = main_col`. How can I add `added_col` to the .CSV file without importing the entire file just to save it again? The calculations/format of the data are irrelevant to the question I'm asking. – wildcat89 Oct 17 '21 at 07:59
  • All of that matters absolutely. Answering your question requires to parse the lines. It is impossible to read column by column. Only line by line. – mozway Oct 17 '21 at 08:04
  • I'm not sure I follow, but I'll update my question with reproducible code to experiment with... – wildcat89 Oct 17 '21 at 08:07
  • Updated. Not sure if that helps much but maybe? – wildcat89 Oct 17 '21 at 08:16
  • This is fundamentally a duplicate of many other "can I add something in the middle of a file without rewriting the rest" questions. Generally no, you can't. – tripleee Oct 17 '21 at 08:18
  • @tripleee Feel free to cite some references if this question has already been asked? – wildcat89 Oct 17 '21 at 08:19
  • https://stackoverflow.com/questions/2398129/writing-to-the-middle-of-the-file-without-overwriting-data is specific to Windows and https://stackoverflow.com/questions/19584097/write-in-an-existing-file-without-overwriting-in-fortran asks about Fortran but those top search engine hits should give you the gist. It's not architecturally impossible to write a file system which supports this but I don't know of an OS which actually implements this feature, and it seems unlikely that the added complexity would make it worthwhile for this feature alone. – tripleee Oct 17 '21 at 08:25
  • @MattWilson in python its hardly possible, you can do this by linux command mentioned here https://stackoverflow.com/questions/42052462/adding-a-new-column-to-a-csv-file – Rakesh Kumar Oct 17 '21 at 08:27
  • @MattWilson so this is a simple csv without quotes or nasty stuff that makes it difficult to parse. Now regarding the calculations, will the real use case be something that only depends on one row? If yes, loop over the lines, perform calculation, save line to new file, when finished replace file – mozway Oct 17 '21 at 08:28
  • @RakeshKumar what you do with sed you can do with python. `sed i-` is **not** working in place, it is making a transparent copy of the file and renames at the end – mozway Oct 17 '21 at 08:29
  • 1
    I'm not super familiar with Parquet but based on quick scanning of documentation, it could _reduce_ the amount of information you have to rewrite rather drastically, compared to simple row-based formats. Another option might be to use a database, where the engine transparently takes care of any internal reorganization when you change the schema. – tripleee Oct 17 '21 at 08:31
  • @mozway unsure but i think it work in chunking rather than loading all data in memory. – Rakesh Kumar Oct 17 '21 at 08:34
  • @MattWilson may be you can try reading this csv in to chunks & write back each chunk after adding column. – Rakesh Kumar Oct 17 '21 at 08:35
  • @mozway Ah yes, now I see what you're saying. Well, in the real use case, no, I would be incorporating many rows in the calculation, however the methodology could be the same, although creating another dataset just as large the current one isn't TOO appealing. I wonder if there's a way to read in sections of rows of the first file, write to the second file, then delete that chunk of rows from the original file? So in essence, as the new file size increases, the old one decreases? – wildcat89 Oct 17 '21 at 08:37
  • @RakeshKumar Yes, I actually saw this post regarding that, however when I looked into the chunksize feature, it said "Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks...." so I don't know if that would work?? https://stackoverflow.com/questions/59388917/how-to-read-a-large-csv-file-without-opening-it-to-have-a-sum-of-numbers-for-eac – wildcat89 Oct 17 '21 at 08:39
  • @MattWilson then I am afraid what you want to do is not possible with flat files. It would be like wanting to drive only the even miles on a road. You can read only a column to save the data in memory, but you'll have to read the file as many times are there are columns to process. Now if you're working with another format (like a database) that should be more efficient. – mozway Oct 17 '21 at 08:41
  • 1
    @MattWilson chunks actually works, i had same problem. When we apply chunk size, it provides file reader object to loop.i I supposed to read a csv & load to db. Due to memory constraints my 1.2 GB file was crashing my Docker. So i applied this technique. Here you can see https://github.com/rakeshkaswan/datatask-interesting-subreddits-challenge/blob/main/pipeline/ingestion.py line number 65. – Rakesh Kumar Oct 17 '21 at 08:50
  • @RakeshKumar Thanks! I'll look into that more! :D – wildcat89 Oct 17 '21 at 09:17

1 Answers1

0

In working with some of the feedback I received in the comments, here is my hacky workaround to this problem. Not efficient, and it doesn't even work as is, but it could be made to work. Use it as a pseudocode representation of what I want to accomplish. I'll look into the chunksize thing too as per @RakeshKumar:

# Idea 1
# Start a new file. The columns are known, so this is fine, tho not very efficient
import csv
columns = ['col1','col2','col3','col4','col5','col6','col7','col8','col9','col10','label','added_col']
with open('new_file.csv','a') as f:
    writer = csv.writer(f)
    writer.writerow(columns)

    # Used in a while loop
    keepGoing = True

    # A skip rows counter
    skipRows = 0

    # As long as we don't run into import issues....
    while keepGoing:

        try:

            # Read in one line from the file
            df = read_csv("./the_file.csv", nrows=1, skiprows=skipRows, header=0)

            # Perform your calculation
            df['added_col'] = df['col10'] / 2

            # Write the new row to the new file
            writer.writerow(df.iloc[0,:])

            # Do the next line
            skipRows+=1
        
        # Once we've run out of rows in first file, stop the loop
        except:
            break

So in effect, we're only reading in one line at a time from the first file, appending to the new file, then when we're done, we could just delete the first file. Not efficient, but would keep the memory load down when using giant datasets!

wildcat89
  • 1,159
  • 16
  • 47