to_csv takes too much time to complete for xml object

Question

I have an xml object which needs to be written to a file. I see that this takes more than 1 hour to complete for 10,000 records. I tried to convert using df_merge['xml'] = df_merge['xml'].astype(str). Still time taken is similar i.e. more than 1 hour just that astype(str) takes more time. So, whatever be the scenario, it takes more than 1 hour to complete to_csv. So, can I please know how to write large xml object to a file quickly? Size of 10000 xmls will be around 600 MB.

df_merge.to_csv(settings.OUTPUT_XML, encoding='utf-8', index=False,
                columns=['xml'])

Later I tried to use np.savetxt which also takes similar time.

import numpy as np
np.savetxt('output_xml.txt', df_merge['xml'], encoding='utf-8', fmt="%s")

score 0 · Answer 1 · answered Sep 11 '20 at 16:08

You might consider using serialization. A good library for that is joblib, or other common serialization tools like pickle

A good Stack Overflow post outlining the differences and when to use each is here

In your case, you might be able to serialize your object and it would be done so in much more time, using some example code from below:

# Import joblib's dump function
from joblib import dump

# For speed, keep compression = 0
dump(df_merge, 'df_merge.joblib')

# For smaller file size, you can increase compression, though it will slow your write time
# dump(df_merge, 'df_merge.joblib', compress=9)

You can then use joblib to load the file, like so:

# Import joblib's load function
from joblib import load

# For speed, keep compression = 0
# Note, if you used compress=n, then it will take longer to load
df_merge = load('df_merge.joblib')

to_csv takes too much time to complete for xml object

1 Answers1