I started out with a deeply nested JSON file (sample of one of the records can be found here) and was able to turn it into the result below
TopicId,CallID,"BudgetTopicActionMap,12345","BudgetTopicActionMap,12348","BudgetTopicActionMap,12350","BudgetTopicActionMap,17659"
COMPET-09-2014,H2020-COMPET-2014,,"[{'action': 'COMPET-09-2014 - CSA Coordination and support action', 'plannedOpeningDate': '11 December 2013', 'deadlineModel': 'single-stage', 'deadlineDates': ['26 March 2014'], 'budgetYearMap': {'0': 53040000}, 'budgetTopicActionMap': {}}, {'action': 'COMPET-06-2014 - RIA Research and Innovation action', 'plannedOpeningDate': '11 December 2013', 'deadlineModel': 'single-stage', 'deadlineDates': ['26 March 2014'], 'budgetYearMap': {'0': 53040000}, 'budgetTopicActionMap': {}}, {'action': 'COMPET-05-2014 - CSA Coordination and support action', 'plannedOpeningDate': '11 December 2013', 'deadlineModel': 'single-stage', 'deadlineDates': ['26 March 2014'], 'budgetYearMap': {'0': 53040000}, 'budgetTopicActionMap': {}}]",,
COMPET-11-2014,H2020-COMPET-2014,,"[{'action': 'COMPET-02-2014 - RIA Research and Innovation action', 'plannedOpeningDate': '11 December 2013', 'deadlineModel': 'single-stage', 'deadlineDates': ['26 March 2014'], 'budgetYearMap': {'0': 53040000}, 'budgetTopicActionMap': {}}, {'action': 'COMPET-11-2014 - CSA Coordination and support action', 'plannedOpeningDate': '11 December 2013', 'deadlineModel': 'single-stage', 'deadlineDates': ['26 March 2014'], 'budgetYearMap': {'0': 53040000}, 'budgetTopicActionMap': {}}, {'action': 'COMPET-01-2014 - RIA Research and Innovation action', 'plannedOpeningDate': '11 December 2013', 'deadlineModel': 'single-stage', 'deadlineDates': ['26 March 2014'], 'budgetYearMap': {'0': 53040000}, 'budgetTopicActionMap': {}}]",,
COMPET-1-2016,H2020-COMPET-2016,,,"[{'action': 'COMPET-3-2016-a - IA Innovation action', 'plannedOpeningDate': '10 November 2015', 'deadlineModel': 'single-stage', 'deadlineDates': ['03 March 2016'], 'budgetYearMap': {'2016': 18000000}, 'budgetTopicActionMap': {}}]","[{'action': 'COMPET-2-2016 - RIA Research and Innovation action', 'plannedOpeningDate': '10 November 2015', 'deadlineModel': 'single-stage', 'deadlineDates': ['03 March 2016'], 'budgetYearMap': {'2016': 7000000}, 'budgetTopicActionMap': {}}]"
COMPET-2-2016,H2020-COMPET-2016,,,"[{'action': 'COMPET-3-2016-a - IA Innovation action', 'plannedOpeningDate': '10 November 2015', 'deadlineModel': 'single-stage', 'deadlineDates': ['03 March 2016'], 'budgetYearMap': {'2016': 18000000}, 'budgetTopicActionMap': {}}]","[{'action': 'COMPET-2-2016 - RIA Research and Innovation action', 'plannedOpeningDate': '10 November 2015', 'deadlineModel': 'single-stage', 'deadlineDates': ['03 March 2016'], 'budgetYearMap': {'2016': 7000000}, 'budgetTopicActionMap': {}}]"
By dropping some columns and using
df = pd.json_normalize(list, sep = ",")
The problem is that this created +/- 200 "BudgetTopicActionMap" columns which, for each record, either contain a dictionary or are empty. I have not been able to flatten these dictionaries further, as I need to create a new column for every key, and sometimes one dictionary contains multiple identical keys. In addition, some records have multiple BudgetTopicAction columns that are non-empty, which would require a melt operation of some sort to turn it into long data instead. (I guess I'd need a melt operation for the non-unique keys per dictionary as well)
This would be my desired result:
identifier,callIdentifier,action,plannedOpeningDate,deadlineModel,deadlineDates,budgetYearMap,budgetTopicActionMap
COMPET-09-2014,H2020-COMPET-2014,COMPET-09-2014 - CSA Coordination and support action,2013-12-11,single-stage,['26 March 2014'],{'0': 53040000},{}
COMPET-09-2014,H2020-COMPET-2014,COMPET-06-2014 - RIA Research and Innovation action,2013-12-11,single-stage,['26 March 2014'],{'0': 53040000},{}
COMPET-09-2014,H2020-COMPET-2014,COMPET-05-2014 - CSA Coordination and support action,2013-12-11,single-stage,['26 March 2014'],{'0': 53040000},{}
COMPET-11-2014,H2020-COMPET-2014,COMPET-02-2014 - RIA Research and Innovation action,2013-12-11,single-stage,['26 March 2014'],{'0': 53040000},{}
COMPET-11-2014,H2020-COMPET-2014,COMPET-11-2014 - CSA Coordination and support action,2013-12-11,single-stage,['26 March 2014'],{'0': 53040000},{}
COMPET-11-2014,H2020-COMPET-2014,COMPET-01-2014 - RIA Research and Innovation action,2013-12-11,single-stage,['26 March 2014'],{'0': 53040000},{}
COMPET-3-2016-a,H2020-COMPET-2016,COMPET-3-2016-a - IA Innovation action,2015-11-10,'single-stage,['03 March 2016'],{'2016': 18000000},{}
COMPET-3-2016-a,H2020-COMPET-2016,COMPET-2-2016 - RIA Research and Innovation action,2015-11-10,'single-stage,['03 March 2016'],{'2016': 7000000},{}
COMPET-2-2016,H2020-COMPET-2016,COMPET-3-2016-a - IA Innovation action,2015-11-10,'single-stage,['03 March 2016'],{'2016': 18000000},{}
COMPET-2-2016,H2020-COMPET-2016,COMPET-2-2016 - RIA Research and Innovation action,2015-11-10,'single-stage,['03 March 2016'],{'2016': 7000000},{}
The dictionaries are flattened into their own columns (based on the keys), and where identical keys exist within a dictionary, additional rows are created. When, for a given identifier and callIdentifier, there are multiple non-zero budgetTopicAction columns, each column gets its own record.
I tried a lot of approaches mentioned here on SO (like this one or this one), but somehow I can't seem to combine them to get to the desired result. Either my dataframe does not change at all or it changes for the worse. It seems I need to both flatten and melt, for which I found several answers, but these were all catered toward one or two columns, not 200.(Or did not take non-unique keys per dictionary into account)
Any ideas on how to approach this?