0

In parallel, I'm saving a lot of csv's without column names so I can use cat to quickly bring them together in unix without bringing them together. These csv's have 1000+ columns, so I'd like to save a single copy of the column headers for when I re-aggregate them, but because I am creating these files in parallel, I'd like only the first job to pickle the pandas index object.

Normally, to avoid the race condition when writing a file, I'd use os.open() As per this question. However, I tried to combine that when pickling:

cPickle.dump( df.columns, os.open( "/dir/my_columns.pkl", os.O_WRONLY|os.O_CREAT|os.O_EXCL ),-1 )

But the second argument to pickle.dump needs an argument with a write attribute, but the object returned by os.open() does not have an attribute. Is there an alternative that would allow me to safely save this object if multiple threads are racing to create it?

Community
  • 1
  • 1
Kyle Heuton
  • 9,318
  • 4
  • 40
  • 52
  • To avoid race conditions, lock the file for writing, release the lock when done. And the don't open files like that, especially in a case where you're dealing with multiple threads trying to open the same file. Use the `with` statement. – msvalkon Apr 30 '14 at 07:53
  • I think your question is how to create a file object from a file descriptor. Or how to pass flags to the `open` function. – User Apr 30 '14 at 13:38

1 Answers1

0

You can create a new file using os.open, close it os.close and then reopen it with open:

import os
import cPickle

fileName = "/dir/my_columns.pkl"
try:
    # Ensure that only one thread continues beyond this point
    os.close(os.open( fileName, os.O_CREAT|os.O_EXCL ))
except OSError:
    raise RuntimeError(fileName + " could not be created.")

with open(fileName, 'w') as f:
    cPickle.dump( df.columns, f, -1 )
musically_ut
  • 34,028
  • 8
  • 94
  • 106