I have to generate dozens of .csv files with millions of rows and dozens of columns. I am currently generating files by doing a groupby
of columns A and B and looping to dynamically generate files with to_csv
. Below is an example of what I am trying to do. Is there a faster technique? My actual dataframe takes more than 10 minutes to run and is becoming quite painful, and this is something that would be beneficial on several projects.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 3, size=(10000,3)), columns=list('ABC'))
%timeit for (a,b), x in df.groupby(['A', 'B']): x.to_csv(f'{a}_Invoice_{b}.csv', index=False)
Time elapsed:
45.2 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Also, I created a function, which had a similar time, but I wanted to post it, so people could more easily modify it for for using %timeit
if the answer is more than one line of code.
import pandas as pd
import numpy as np
def generate_invoices(df):
for (a,b), x in df.groupby(['A', 'B']):
x.to_csv(f'{a}_Invoice_{b}.csv', index=False)
return
df = pd.DataFrame(np.random.randint(0, 3, size=(10000,3)), columns=list('ABC'))
%timeit generate_invoices(df)