2

Recently I had an interview task that I should read a huge csv and aggregate some columns and write it in a new csv file I did the code in pandas using chunks but they said the method is not good and I need to use chunks (while i did). now I am confused what is the problem with my method:

df = pd.read_csv(file, usecols=['Department Name', 'Number of sales'], chunksize=100)
pieces = [x.groupby('Department Name')['Number of sales'].agg(['sum']) for x in df]
result = pd.concat(pieces).groupby(level=0).sum().rename(columns={'sum': 'Total Number of Sale'})
result.to_csv('output.csv')
user907988
  • 625
  • 1
  • 5
  • 17
  • Nothing wrong with this code. Maybe they wanted you to evaluate the `chunksize` parameter (i.e. maybe increase it a bit) and tune it for optimal performance. – khan Nov 08 '21 at 12:23
  • 1
    [read huge csv file](https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file), depend on the problem, but here i can say, if chunk size is small and file size is large (say 1 mil) then you need to read the file 10k time and write 10k output file which kinda make it less good, by increasing the chunk size or just reading the file line by line then making the result might help , as a interviewe i think you should ask for restirctions like memory size need to consider (RAM), time constraint – sahasrara62 Nov 08 '21 at 12:26

0 Answers0