I'm new to python and currently undertaking a workplace project where we're moving excel based analysis to python for future proofing and i have hit a bit of a hurdle i'm struggling to resolve.
i have data with 3 hierarchical levels, in the below format, where the 1st level is a key internal metric when analysis business performance, 2nd level is the member/client's gender and 3rd is the year.
Suggested code to generate input dataframe:
col = pd.MultiIndex.from_product([['A', 'B'], ['F', 'M', 'DNC'],[2016,2017]])
df = pd.DataFrame([[1,2,6,16,20,19,17,9,4,2,8,19]], columns=col)
table markdown formatting doesnt appear to have pulled through correctly
| A | B
----------------------------------------------------------------------------
| | F | M | DNC | F | M | DNC
----------------------------------------------------------------------------
| |2016 |2017 |2016 |2017 |2016 |2017 |2016 |2017 |2016 |2017 |2016 |2017
----------------------------------------------------------------------------
|Div |
| 0 | 1 | 2 | 6 | 16 | 20 | 19 | 17 | 9 | 4 | 2 | 8 | 10
i currently divisions (Div.) set as my index and is 59 entries in length. There are 6 different metrics and years span from 2010 to present. the data frame will be outputted into a xlsx file with each metric on its own tab and in turn this is to be used to generate line plots via seaborn.
i have the code every for the importing of the data once formatted and to generate the graphs.
i'm trying to add in an additional column into the 2nd level hierarchy to provide a total per year. columns i am struggling to add into the dataframe anotated by "*"
Again table markdown formatting challenge
| Metric | A |
------------------------------------------------------------
| Gender | F | M | DNC | *Total* |
------------------------------------------------------------
| Year | 2016 | 2017| 2016 | 2017| 2016 | 2017| *2016* | *2017*|
------------------------------------------------------------
| Div. | |
| 1 | 1 | 2 | 6 | 16 | 20 | 19 | *27* | *37* |
i have tried using get_level_values, .groupby(level =[0,2] and a few other avenues where i am able to display the values however i havent been able to append them into the original dataframe.
df.groupby(level=[0,2], axis = 1).sum():
| Metric | A |
---------------------
| Year | 2016 | 2017|
---------------------
| Div. | |
| 1 | 27 | 37 |
I've also tried flattening the columns into the format Metric.Gender.Year e.g. (A.F.2016). However that then causes an error with the output when writing to multi-tabs based on the Metric. I'm not to familiar with how this code works as to how it can be amended to split the data based on the substring. I am thinking of of running a list of the divisions through a for loop utilising str.startswith:
for x in div_list:
for group, data in df.groupby(df.columns.str.startswith(x), axis=1):
data.to_excel (pd.ExcelWriter(filepath), group, merge_cells = False)
writer.close()
however have yet to test the above (am suspecting there to be some error in the code above given my knowledge level of python)
Once i have the above added, i will also be normalising the results by client count, i have this information by division and year. and assigning these results to a different dataframe to then be used to generate the same graphs as the original quey.
Div. | Count.F.2016 | Count.M.2016 | Count.Total.2016 | .. |
---|---|---|---|---|
1 | 100 | 90 | 190 | .... |
i've only added the further two points (writing to excel and normalisation) for greater context of the query. however any advice you may have will be greatly appreciated.
I also fully suspect some of my workings/endeavours to not be the most pythonic and suspect my existing code has a lot of inefficiencies, however i'm keen to get the core work progressed so that the output can be used by key stakeholders while I then try to make the code more efficient, as this is a piece of work that is run semi-regularly (i have set up the script to import the underlying data and key parameters to allow for easy modification, (as few key parameters being hardcoded as possible)