I know that with Pandas, you can use the CSV writer in "append" mode to add new rows to the file, but I'm wondering, is there a way to add a new column to an existing file, without having to first load the file like:
df = pd.read_csv/excel/parquet("the_file.csv")
Reason I ask is, sometimes I'm dealing with huge datasets, and loading them into memory is expensive when all I'd like to do is just add 1 column to the file.
As an example, I have a huge dataset stored already, I load one column from that dataset to perform a calculation from it which gives me another column of data. Now I'd like to add that new column, same length of rows and everything, to the file, without first importing it. Possible?
Here's is a reproducible code if needed. I'm using this on much larger datasets, but the premise would be the exact same regardless:
from sklearn.datasets import make_classification
from pandas import DataFrame, read_csv
# Make a fake binary classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_classes=2)
# Turn it into a dataframe
df = DataFrame(X, columns=['col1','col2','col3','col4','col5','col6','col7','col8','col9','col10'])
df['label'] = y
# Save the file
df.to_csv("./the_file.csv", index=False)
# Now, load one column from that file
main_col = read_csv("./the_file.csv", usecols=["col1"])
# Perform some random calculation to get a new column
main_col['new_col'] = main_col / 2
Now, how can you add main_col['new_col']
to ./the_file.csv
, without first importing the entire file, adding the column, then resaving?