8

I'm trying to append a pandas DataFrame (single column) to an existing CSV, much like this post, but it's not working! Instead my column is added at the bottom of the csv, and repeated over and over (rows in csv >> size of column). Here's my code:

with open(outputPath, "a") as resultsFile:
    print len(scores)
    scores.to_csv(resultsFile, header=False)
    print resultsFile

Terminal output:4032 <open file '/Users/alavin/nta/NAB/results/numenta/artificialWithAnomaly/numenta_art_load_balancer_spikes.csv', mode 'a' at 0x1088686f0>

Thank you in advance!

Community
  • 1
  • 1
BoltzmannBrain
  • 5,082
  • 11
  • 46
  • 79
  • 1
    two additional bits of info would be helpful: 1) what do the existing contents of `resultsFile` look like? (confirm by inspecting the file manually) and 2) what does the `scores` DataFrame look like? (`scores.head(10)` should suffice) – Paul H Jan 08 '15 at 18:44
  • 1
    It's appending the `scores` data frame at the end of the file because that's how the pandas `to_csv` functionality works. If you want to append `scores` as a new column on the original csv data frame then you would need to read the csv into a data frame, append the `scores` column and then write it back to the csv. – alacy Jan 08 '15 at 18:45
  • `resultsFile` is a csv of 5 columns: 'timestamp', 'value', 'aaa', 'bbb', 'label'. I would like the 6th to be the `scores` DataFrame. I've verified all columns are the same length. `scores` has a column header 's'. – BoltzmannBrain Jan 08 '15 at 19:08
  • @aus_lacy I should've clarified, I'm trying to do this without reading in the csv... – BoltzmannBrain Jan 08 '15 at 19:19
  • @alavin89 I don't think it's possible to append the column to the original data frame within the csv without opening the file and parsing the data since python has no way of knowing that there is a data frame in the csv to append to. – alacy Jan 08 '15 at 19:23
  • @alavin89, as a side note, if you happen `df.to_csv` your data frame's to `f2.csv` file, and your existing file called `f1.csv`... also you're using `*nix`, then you can just use: **paste**: `paste -d", " f1.csv f2.csv > output.csv` – Anzel Jan 08 '15 at 21:59

2 Answers2

12

Like what @aus_lacy has already suggested, you just need to read the csv file into a data frame first, concatenate two data frames and write it back to the csv file:

supposed your existing data frame called df:

df_csv = pd.read_csv(outputPath, 'your settings here')

# provided that their lengths match
df_csv['to new column'] = df['from single column']

df_csv.to_csv(outputPath, 'again your settings here')

That's it.

Anzel
  • 19,825
  • 5
  • 51
  • 52
  • 2
    I'm trying to avoid opening and reading in all that data, but this does work :) – BoltzmannBrain Jan 08 '15 at 20:05
  • @alavin89, do you have to use python? – Anzel Jan 08 '15 at 20:29
  • @alavin89, then it will be difficult, provided that you still need to open+read each line of your csv to find the linefeed and append your new column. I don't like **DiskIO**-wise you can have an easy solution – Anzel Jan 08 '15 at 21:55
  • For some reason, to_csv() is adding data in new rows, I want to add the dataframe in new columns. Can you please help? http://pythonfiddle.com/copy-csv-and-dataframe – Veronica Aug 27 '20 at 00:30
  • @Veronica are you certain the length of two dataframes actually matched? – Anzel Aug 27 '20 at 00:34
  • Yes. I made a small mistake. I wasn't reading the new csv using 'pd.read_csv()' but created a new dataframe. Fixed it. Thank you for your time and solution above. Still helping newbies :) – Veronica Aug 27 '20 at 00:37
1

I find the solution problematic, if many columns are to be added to a large csv file iteratively.

A solution would be to accept the csv file to store a transposed dataframe. i.e. headers works as indices and vice versa.

The upside is that you don't waste computation power on insidious operations.

Here is operation times for regular appending mode, mode='a', and appending column approach for series with length of 5000 appended 100 times:

enter image description here

The downside is that you have to transpose the dataframe to get the "intended" dataframe when reading the csv for other purposes.

Code for plot:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

col = []
row = []
N = 100

# Append row approach
for i in range(N):
    t1 = dt.datetime.now()
    data = pd.DataFrame({f'col_{i}':np.random.rand(5000)}).T
    data.to_csv('test_csv_data1.txt',mode='a',header=False,sep="\t")
    t2 = dt.datetime.now()
    row.append((t2-t1).total_seconds())

# Append col approach
pd.DataFrame({}).to_csv('test_csv_data2.txt',header=True,sep="\t")
for i in range(N):
    t1 = dt.datetime.now()
    data = pd.read_csv('test_csv_data2.txt',sep='\t',header=0)
    data[f'col_{i}'] = np.random.rand(5000)
    data.to_csv('test_csv_data2.txt',header=True,sep="\t")
    t2 = dt.datetime.now()
    col.append((t2-t1).total_seconds())
    
t = pd.DataFrame({'N appendices':[i for i in range(N)],'append row':row,'append col':col})
t = t.set_index('N appendices')

DisabledWhale
  • 771
  • 1
  • 7
  • 15