2

I have to generate dozens of .csv files with millions of rows and dozens of columns. I am currently generating files by doing a groupby of columns A and B and looping to dynamically generate files with to_csv. Below is an example of what I am trying to do. Is there a faster technique? My actual dataframe takes more than 10 minutes to run and is becoming quite painful, and this is something that would be beneficial on several projects.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 3, size=(10000,3)), columns=list('ABC'))
%timeit for (a,b), x in df.groupby(['A', 'B']): x.to_csv(f'{a}_Invoice_{b}.csv', index=False)

Time elapsed:

45.2 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Also, I created a function, which had a similar time, but I wanted to post it, so people could more easily modify it for for using %timeit if the answer is more than one line of code.

import pandas as pd
import numpy as np

def generate_invoices(df):
    for (a,b), x in df.groupby(['A', 'B']):
        x.to_csv(f'{a}_Invoice_{b}.csv', index=False)
    return

df = pd.DataFrame(np.random.randint(0, 3, size=(10000,3)), columns=list('ABC'))
%timeit generate_invoices(df)
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • 2
    I think here main bottleneck is write to huge amount of files. – jezrael Apr 12 '20 at 07:00
  • Thanks for your input @jezrael . These all have to go to individual customers, so it is a requirement. – David Erickson Apr 12 '20 at 07:02
  • 1
    btw, if use `for (a,b), x in df.groupby(['A', 'B'], sort=False):` it is better performance with real data? – jezrael Apr 12 '20 at 07:03
  • I'll have to test on the other computer when I have in front of me next week. I'll try that and let you know if it helped. – David Erickson Apr 12 '20 at 07:06
  • Tested with sample and it is slowier :( – jezrael Apr 12 '20 at 07:06
  • just out of curiosity, why r u creating this individual files? and for semantics is it any different if you say ```for a,b in df.groupby(['A','B']): b.to_csv bla bla``` ? – sammywemmy Apr 12 '20 at 07:09
  • From the real dataset, Column A is the Customer (e.g. Company A, Company B, Company C) and Column B are the different multiple offices of each customer (e.g. London, New York, San Francisco). Each customer needs a different .csv for each of their offices. Semantics of the code don't matter to me as long as the right data gets outputted faster. – David Erickson Apr 12 '20 at 07:16
  • If I am doing it in a way that is already optimal, then that is okay with me. I was just wondering if I was doing something that was very sub-optimal. If I save 10% or 20% of time, then it doesn't really matter. I was wondering if there was something fundamentally faster that could increase the performance 10x or 100x. – David Erickson Apr 12 '20 at 07:19
  • 1
    I suspect there might be a way to this quickly with awk, though it's beyond my skill level. Since you aren't doing any aggregation in the `groupby`, it would be sufficient to just sort by `A` and `B`, then split into new files whenever either value changes, right? Maybe try the resident Unix boffins on here. – Josh Friedlander Apr 12 '20 at 07:25
  • @JoshFriedlander for this particular project, I am not doing any aggregations, but for others I am. Thanks, that is helpful. If it's easy for you, then please post what your suggestion would be as an answer. Otherwise, I'll have a play with it myself. – David Erickson Apr 12 '20 at 07:31
  • 1
    If you're aggregating, you can do it in Pandas and export to csv, then you're able to do the split in any language. I don't have any ideas but I posted a question [here](https://stackoverflow.com/questions/61171088/split-csv-file-thousands-of-times-based-on-groupby). – Josh Friedlander Apr 12 '20 at 12:02
  • @DavidErickson I think Alex's awk answer on that question will be close to what you want - just need to add headers afterwards. Any way you could time it on your data and let me know if it works/is faster? – Josh Friedlander Apr 12 '20 at 12:28
  • @JoshFriedlander I've added in another example that does a second pass, adding the header in. – Alex Apr 12 '20 at 12:40
  • Does this answer your question? [Split csv file thousands of times based on groupby](https://stackoverflow.com/questions/61171088/split-csv-file-thousands-of-times-based-on-groupby) – Alex Apr 12 '20 at 13:39
  • @Alex yeah, looks perfect. Waiting on OP for feedback – Josh Friedlander Apr 12 '20 at 14:05

0 Answers0