3

I am trying to loop through multiple excel files in pandas. The structure of the files are very much similar, the first 10 column forms a key and rest of the columns have the values. I want to group by first 10 columns and sum the rest.

I have searched and found solutions online for similar cases but my problem is that

  • I have large number of columns with values ( to be aggregate as sum) and

  • Number / names of columns(with values) is different for each file(dataframe)

    #Key columns are same across all the files.

I can't share the actual data sample but here is the format sample of the file structure

enter image description here

and here is the desired output from the above data

enter image description here

It is like a groupby operation but with uncertain large number of columns and header name makes it difficult to use groupby or pivot. Can Any one suggest me what is the best possible solution for it in python.

Edited:

df.groupby(list(df.columns[:11])).agg(sum)

is working but for some reason it is taking 25-30 mins. the same thing MS Access is done in 1-2 mins . Am I doing something wrong here or is there any other way to do it in python itself

LOKE2707
  • 312
  • 1
  • 5
  • 19
  • Why not `df.groupby([f'id_{i}' for i in range(1,11)]).agg(sum)`? Can you be a bit more precise about what difficulty you are having? – ALollz Feb 08 '20 at 04:46
  • what file is that? Excel, csv? You're not helping us helping you – deadvoid Feb 08 '20 at 04:54
  • @ALollz In the actual data set the column names are strings(names) and not integer so can't loop.. but thanks you gave me an idea I Can have a list of 1st 10 column headers which I can Loop. I'll try that and come back on it. – LOKE2707 Feb 08 '20 at 05:02
  • @deadvoid It's an .xlsx file imported as pandas dataframe – LOKE2707 Feb 08 '20 at 05:03
  • If having trouble getting no results with a large number of dataframes, this could be useful: https://stackoverflow.com/questions/62303505/group-by-returns-empty-dataframe-and-no-error – Josiah Yoder Aug 09 '22 at 16:03

1 Answers1

1

Just use df.columns which has the list of columns, you can then use a slice on that list to get the 10 leftmost columns.

This should work:

df.groupby(df.columns[:10].to_list()).sum()
filbranden
  • 8,522
  • 2
  • 16
  • 32