16

I have a function set up for Pandas that runs through a large number of rows in input.csv and inputs the results into a Series. It then writes the Series to output.csv.

However, if the process is interrupted (for example by an unexpected event) the program will terminate and all data that would have gone into the csv is lost.

Is there a way to write the data continuously to the csv, regardless of whether the function finishes for all rows?

Prefarably, each time the program starts, a blank output.csv is created, that is appended to while the function is running.

import pandas as pd

df = pd.read_csv("read.csv")

def crawl(a):
    #Create x, y
    return pd.Series([x, y])

df[["Column X", "Column Y"]] = df["Column A"].apply(crawl)
df.to_csv("write.csv", index=False)
P A N
  • 5,642
  • 15
  • 52
  • 103
  • 1
    write in chunks as you go and append to the csv, use `mode = 'a',header=False` after the first write. You can – Padraic Cunningham Jun 27 '15 at 15:21
  • Also does the order matter? – Padraic Cunningham Jun 27 '15 at 15:34
  • Do you mean the order of the columns? If so, yes they need to be in a certain order. – P A N Jun 27 '15 at 15:38
  • No I meant the order of the rows, you could use multiprocessing and write to multiple files, then simply create a single file from all the files – Padraic Cunningham Jun 27 '15 at 15:45
  • That sounds interesting but I'm using `selenium` as a driver for writing variables to the csv so perhaps it would be a bit taxing to open multiple browsers and have them parse thru the csv simultaneously. But multiprocessing sounds interesting as a concept. The order of the rows doesn't matter though. – P A N Jun 27 '15 at 15:49
  • @PadraicCunningham If you have a working example of the code I'd be happy to take a look. I tried this solution out-of-the-box but nothing was written: http://stackoverflow.com/questions/28387002/keep-header-while-appending-to-pandas-dataframe-w-python – P A N Jun 27 '15 at 15:57
  • `if writeHeader is True: df.to_csv("file.csv", mode="a", header="True", index=False) writeHeader = False else: df.to_csv("file.csv", mode="a", header="False", index=False)` – P A N Jun 27 '15 at 15:58
  • 1
    You can use `if os.path.isfile()`. http://stackoverflow.com/questions/30991541/pandas-write-csv-append-vs-write/30991707#30991707 – Padraic Cunningham Jun 27 '15 at 16:10
  • How did you check if anything was written? The `if os.path.isfile()` is not going to cause much overhead and is easy to apply the logic – Padraic Cunningham Jun 27 '15 at 16:11
  • Actually the best way would be to create a new file each time the program starts, and then append for each pass in the function. – P A N Jun 27 '15 at 16:13
  • Yes but how long does the function run for? Do you repeatedly call it? – Padraic Cunningham Jun 27 '15 at 16:14
  • The function runs for each row in `read.csv`. Output goes to `write.csv`. – P A N Jun 27 '15 at 16:15
  • ok so easy to split it up into multiple files then, it will be trivial to create a single file once you store them all together. appending will work fine though as you won't lose data opening for appending as opposed to opening for writing – Padraic Cunningham Jun 27 '15 at 16:16
  • Another option would be to write to a NamedTemporaryFile then use shutil.move to continually replace a file on your file system every n rows or whatever period you want – Padraic Cunningham Jun 27 '15 at 16:32
  • @PadraicCunningham That might be an option. Will have to look at it. – P A N Jun 27 '15 at 16:41
  • 1
    Here is an example http://stackoverflow.com/questions/30776900/python-how-to-add-a-space-string-to-the-end-of-every-line-in-a-text-conf/30777077#30777077 – Padraic Cunningham Jun 27 '15 at 16:49
  • You will need to do the writing as you crawl, creating the full columns first is not going to work – Padraic Cunningham Aug 23 '15 at 10:24

4 Answers4

23

This is a possible solution that will append the data to a new file as it reads the csv in chunks. If the process is interrupted the new file will contain all the information up until the interruption.

import pandas as pd

#csv file to be read in 
in_csv = '/path/to/read/file.csv'

#csv to write data to 
out_csv = 'path/to/write/file.csv'

#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))

#size of chunks of data to write to the csv
chunksize = 10

#start looping through data writing it to a new file for each chunk
for i in range(1,number_lines,chunksize):
     df = pd.read_csv(in_csv,
          header=None,
          nrows = chunksize,#number of rows to read at each loop
          skiprows = i)#skip rows that have been read

     df.to_csv(out_csv,
          index=False,
          header=False,
          mode='a',#append data to csv file
          chunksize=chunksize)#size of data to append for each loop
Tom Patel
  • 422
  • 1
  • 4
  • 11
5

In the end, this is what I came up with. Thanks for helping out!

import pandas as pd

df1 = pd.read_csv("read.csv")

run = 0

def crawl(a):

    global run
    run = run + 1

    #Create x, y

    df2 = pd.DataFrame([[x, y]], columns=["X", "Y"])

    if run == 1:
        df2.to_csv("output.csv")
    if run != 1:
        df2.to_csv("output.csv", header=None, mode="a")

df1["Column A"].apply(crawl)
P A N
  • 5,642
  • 15
  • 52
  • 103
  • If you have suggestions for improvements, please post a full answer and I will change my selected answer accordingly. – P A N Aug 24 '15 at 17:57
  • This won't write the data if your program crashes, you will still lose everything – Padraic Cunningham Aug 26 '15 at 13:39
  • @PadraicCunningham It will write the data for successful passes of `crawl(a)`. But if there's a crash in the current pass, that data will be lost. Not sure how to prevent that except writing to the csv instantly after `x` and `y` have been attained. – P A N Aug 26 '15 at 18:45
  • 1
    You can catch the exception and write in the except block, also I think `df2.to_csv(f, header=None, mode="a") if os.path.isfile(f) else df2.to_csv(f)` is what your ifs are basically doing, I am not seeing how the global fits into the whole process either – Padraic Cunningham Aug 26 '15 at 18:47
  • What can happen to constitute an unsuccessful crawl? I think writing to a tempfile and doing a shutil.move after every `n` iterations would be a good approach, .move is atomic either all or nothing at least on linux. Lastly are you sequentially crawling based on some data, how do you decide where to start again after a crash? – Padraic Cunningham Aug 26 '15 at 18:50
  • @PadraicCunningham Conceptually, if there's something wrong in the code or data source for deriving `x, y`. But more practically for me, because I sometimes interrupt the program and then want to continue later. – P A N Aug 26 '15 at 18:54
  • If you want to interrupt, catch the keyboard interrupt and move whatever has been written to the current tempfile to your csv – Padraic Cunningham Aug 26 '15 at 18:56
  • @PadraicCunningham Thanks for the suggestion! At the moment I simply check manually where it left off, and delete previous rows from the source spreadsheet that the program is crawling sequentially through. Not an automatic system unfortunately :) – P A N Aug 26 '15 at 18:59
  • 1
    No worries, if you do catch the exceptions, you can also output where you are in your input file so at least you don't have to go comparing manually, you could do it all programmatically – Padraic Cunningham Aug 26 '15 at 19:01
1

I would suggest this:

with open("write.csv","a") as f:
    df.to_csv(f,header=False,index=False)

The argument "a" will append the new df to an existing file and the file gets closed after the with block is finished, so you should keep all of your intermediary results.

Ben K.
  • 1,160
  • 6
  • 20
1

I've found a solution to a similar problem by looping the dataframe with iterrows() and saving each row to the csv file, which in your case it could be something like this:

for ix, row in df.iterrows():
    row['Column A'] = crawl(row['Column A'])

    # if you wish to mantain the header
    if ix == 0:
        df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8')
    else:
        df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8', header=False)
tmsss
  • 1,979
  • 19
  • 23