convert dataframe with multiple index as keys of keys with value of columns

Question

The use case of my project is to display each author's commit with the size of each commit per day. This is how I need to represent my data

For this what I did is

timed_commits = commit_data.set_index('Date')
grouped = timed_commits.groupby(by=["Author"])
resampled = grouped.resample("D").agg(
            {"SHA": "size", "Insertion": "sum", "Deletion": "sum"}
        ) # get the total count of commits in a day with total insertion and deletion

this gave me the output as the below one

Here Author and Date are the index while SHA, Insertion and Deletion are the columns. The reason why Author and Date are index is I want to know the per day commits of each author while I want the size of each commit(through insertion) as well

For such object I could not able to format in this way or any other way(better to have field name for author value and date value) which would support to display in the table as in the image I attached for it

{
    'author1': {
        '2017-10-18': {'SHA': 1, 'Insertion': 1.0, 'Deletion': 3.0},
        '2017-10-19': {'SHA': 2, 'Insertion': 1.0, 'Deletion': 3.0},
        '2017-10-20': {'SHA': 6, 'Insertion': 1.0, 'Deletion': 3.0},
        '2017-10-21': {'SHA': 9, 'Insertion': 1.0, 'Deletion': 3.0},
    },
    'author2': {
        '2017-10-18': {'SHA': 3, 'Insertion': 8.0, 'Deletion': 3.0},
        '2017-10-19': {'SHA': 19, 'Insertion': 10.0, 'Deletion': 3.0},
        '2017-10-20': {'SHA': 23, 'Insertion': 1.0, 'Deletion': 3.0},
        '2017-10-21': {'SHA': 44, 'Insertion': 1.0, 'Deletion': 3.0},
    }
}

I played with to_dict but did not make it workable.

this is the dataframe(here commit hash i.e sha is repeated because of the number of files changed in that particular commit). This is taken from git logs.

SHA Timestamp   Date    Author  Insertion   Deletion    Churn   File path
1   cae635054   Sat Jun 26 14:51:23 2021 -0400  2021-06-26 18:51:23+00:00   Andrew Clark    31.0    0.0 31.0    packages/react-reconciler/src/__tests__/ReactI...
2   cae635054   Sat Jun 26 14:51:23 2021 -0400  2021-06-26 18:51:23+00:00   Andrew Clark    1.0 1.0 0.0 packages/react-test-renderer/src/ReactTestRend...
3   cae635054   Sat Jun 26 14:51:23 2021 -0400  2021-06-26 18:51:23+00:00   Andrew Clark    24.0    14.0    10.0    packages/react/src/ReactAct.js
5   e2453e200   Fri Jun 25 15:39:46 2021 -0400  2021-06-25 19:39:46+00:00   Andrew Clark    50.0    0.0 50.0    packages/react-reconciler/src/__tests__/ReactI...
7   73ffce1b6   Thu Jun 24 22:42:44 2021 -0400  2021-06-25 02:42:44+00:00   Brian Vaughn    4.0 5.0 -1.0    packages/react-devtools-shared/src/__tests__/F...
8   73ffce1b6   Thu Jun 24 22:42:44 2021 -0400  2021-06-25 02:42:44+00:00   Brian Vaughn    4.0 4.0 0.0 packages/react-devtools-shared/src/__tests__/c...
9   73ffce1b6   Thu Jun 24 22:42:44 2021 -0400  2021-06-25 02:42:44+00:00   Brian Vaughn    12.0    12.0    0.0 packages/react-devtools-shared/src/__tests__/c...
10  73ffce1b6   Thu Jun 24 22:42:44 2021 -0400  2021-06-25 02:42:44+00:00   Brian Vaughn    7.0 6.0 1.0 packages/react-devtools-shared/src/__tests__/e...
11  73ffce1b6   Thu Jun 24 22:42:44 2021 -0400  2021-06-25 02:42:44+00:00   Brian Vaughn    47.0    42.0    5.0 packages/react-devtools-shared/src/__tests__/i...
12  73ffce1b6   Thu Jun 24 22:42:44 2021 -0400  2021-06-25 02:42:44+00:00   Brian Vaughn    7.0 6.0 1.0 packages/react-devtools-shared/src/__tests__/o...

can you add the code to create the sample df? use `df.head().to_dict()`. Additionally, check -> https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Nk03, Jul 14 '21 at 04:36
yup use the name of your dataframe. I guess it's `commit_data` — Nk03, Jul 14 '21 at 04:40

score 1 · Accepted Answer · edited Jul 15 '21 at 08:18

Maybe I completely misunderstood.

I'm assuming that you want to transform a DataFrame df

                    SHA  Insertion  Deletion
Author  Date                                
author1 2017-10-18    1        1.0       3.0
        2017-10-19    2        1.0       3.0
        2017-10-20    6        1.0       3.0
        2017-10-21    9        1.0       3.0
author2 2017-10-18    3        8.0       3.0
        2017-10-19   19       10.0       3.0
        2017-10-20   23        1.0       3.0
        2017-10-21   44        1.0       3.0

into the dict-format you have provided?

If so, then try this:

result = {
    key: group.reset_index(level=0, drop=True).to_dict(orient='index')
    for key, group in df.groupby('Author')
}

or this

result = (df.groupby('Author')
            .apply(lambda sdf: sdf.reset_index(level=0, drop=True).to_dict(orient='index'))
            .to_dict())

Result for the sample:

{'author1': {'2017-10-18': {'Deletion': 3.0, 'Insertion': 1.0, 'SHA': 1},
             '2017-10-19': {'Deletion': 3.0, 'Insertion': 1.0, 'SHA': 2},
             '2017-10-20': {'Deletion': 3.0, 'Insertion': 1.0, 'SHA': 6},
             '2017-10-21': {'Deletion': 3.0, 'Insertion': 1.0, 'SHA': 9}},
 'author2': {'2017-10-18': {'Deletion': 3.0, 'Insertion': 8.0, 'SHA': 3},
             '2017-10-19': {'Deletion': 3.0, 'Insertion': 10.0, 'SHA': 19},
             '2017-10-20': {'Deletion': 3.0, 'Insertion': 1.0, 'SHA': 23},
             '2017-10-21': {'Deletion': 3.0, 'Insertion': 1.0, 'SHA': 44}}}

EDIT: Another version used by @milan:

result = [
    {
         "author": key,
         "commit_activity": group.to_dict(orient="records"),
         "timestamp": [index[1] for index in list(group.index)]
    }
    for key, group in df.groupby("Author")
]

Result of this version would look like:

[

    {'author': 'Aaron Pettengill',
      'commit_activity': [{'SHA': 2,
        'Insertion': 156.0,
        'Deletion': 8.0,
        'File path': 2}],
      'timestamp': [Timestamp('2020-05-01 00:00:00+0000', tz='UTC')]},
     {'author': 'Alex Rohleder',
      'commit_activity': [{'SHA': 5,
        'Insertion': 5.0,
        'Deletion': 5.0,
        'File path': 5}],
      'timestamp': [Timestamp('2019-09-06 00:00:00+0000', tz='UTC')]},
     {'author': 'Alex Taylor',
      'commit_activity': [{'SHA': 2,
        'Insertion': 30.0,
        'Deletion': 3.0,
        'File path': 2}],
      'timestamp': [Timestamp('2020-04-29 00:00:00+0000', tz='UTC')]}
]

Thank you very much for your help. This is exactly what I am wanting. I wanted the response to be json ready so that I can sent it to the client for showing up in the tables. I will mark this as solved. Also can you suggest me if this one is a better approach to sent to the client? Just a suggestion will be appreciable. — milan, Jul 14 '21 at 12:51
@milan Thanks for the feedback. Regarding your question: Unfortunately, that's not an area in which I have reliable expertise :( My gut-feeling is that sending the json directly would be better. But as I said, this could be a misjudgement on my part. — Timus, Jul 14 '21 at 13:35
I am using graphql for api part where it expects field name ahead of time while sending to the client so that it can validate what kind of fields will be there and of what types. I think I need to format this into an array. — milan, Jul 15 '21 at 03:10
I am thinking of something like this `[ {author: 'author1', commit_activity:[[1, 1, 0], [1, 10, 0], [4, 40, 10]], date: [2016, 2017, 2018]}, {author: 'author2', commit_activity:[[1, 1, 0], [1, 10, 0], [4, 40, 10]], date: [2016, 2017, 2018]},` — milan, Jul 15 '21 at 03:15
I did it in another way as well. `result = [ # key: group.reset_index(level=0, drop=True).to_dict(orient='index') { "author": key, "commitInfo": group.to_dict(orient="records"), "timestamp": [index[1] for index in list(group.index)], } for key, group in work_logs.groupby("author") ]` If it's okay to you, then you can put this case as well. Thank you again for your help. — milan, Jul 15 '21 at 05:49
@milan Sure! I've added your solution to the answer. Feel free to edit it directly if I made a mistake or you want an adjustment. — Timus, Jul 15 '21 at 06:37

convert dataframe with multiple index as keys of keys with value of columns

1 Answers1