In parallel, I'm saving a lot of csv's without column names so I can use cat
to quickly bring them together in unix without bringing them together. These csv's have 1000+ columns, so I'd like to save a single copy of the column headers for when I re-aggregate them, but because I am creating these files in parallel, I'd like only the first job to pickle the pandas index object.
Normally, to avoid the race condition when writing a file, I'd use os.open()
As per this question. However, I tried to combine that when pickling:
cPickle.dump( df.columns, os.open( "/dir/my_columns.pkl", os.O_WRONLY|os.O_CREAT|os.O_EXCL ),-1 )
But the second argument to pickle.dump
needs an argument with a write attribute, but the object returned by os.open()
does not have an attribute. Is there an alternative that would allow me to safely save this object if multiple threads are racing to create it?