6

I have a dataframe with LISTS(with dicts) as column values . My intention is to normalize entire column(all rows). I found way to normalize a single row . However, I'm unable to apply the same function for the entire dataframe or column.

data = {'COLUMN': [ [{'name': 'WAG 01', 'id': '105F', 'state': 'available', 'nodes': 3,'volumes': [{'state': 'available', 'id': '330172', 'name': 'q_-4144d4e'}, {'state': 'available', 'id': '275192', 'name': 'p_3089d821ae', }]}], [{'name': 'FEC 01', 'id': '382E', 'state': 'available', 'nodes': 4,'volumes': [{'state': 'unavailable', 'id': '830172', 'name': 'w_-4144d4e'}, {'state': 'unavailable', 'id': '223192', 'name': 'g_3089d821ae', }]}], [{'name': 'ASD 01', 'id': '303F', 'state': 'available', 'nodes': 6,'volumes': [{'state': 'unavailable', 'id': '930172', 'name': 'e_-4144d4e'}, {'state': 'unavailable', 'id': '245192', 'name': 'h_3089d821ae', }]}] ] }

source_df = pd.DataFrame(data)

source_df looks like below :

enter image description here

As per https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html I managed to get output for one row.

Code to apply for one row:

Target_df = json_normalize(source_df['COLUMN'][0], 'volumes', ['name','id','state','nodes'], record_prefix='volume_')

Output for above code :

enter image description here

I would like to know how we can achieve desired output for the entire column

Expected output:

enter image description here

EDIT: @lostCode , below is the input with nan and empty list

enter image description here

Tad
  • 811
  • 8
  • 16

2 Answers2

3

You can do:

Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index]).reset_index(drop=True)

Output:

    volume_state    volume_id   volume_name  name   id     state     nodes
0   available       330172      q_-4144d4e   WAG 01 105F    available   3
1   available       275192      p_3089d821ae WAG 01 105F    available   3
2   unavailable     830172      w_-4144d4e   FEC 01 382E    available   4
3   unavailable     223192      g_3089d821ae FEC 01 382E    available   4
4   unavailable     930172      e_-4144d4e   ASD 01 303F    available   6
5   unavailable     245192      h_3089d821ae ASD 01 303F    available   6

concat, is used to concatenate a dataframe list, in this case the list that is generated using json_normalize is concatenated on all rows of source_df

You can use to check type of source_df:

Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index if isinstance(source_df['COLUMN'][key],list)]).reset_index(drop=True)
ansev
  • 30,322
  • 5
  • 17
  • 31
  • Thanks a lot. it worked for this input.For input with 'nan' values I'm getting the below error "TypeError: list indices must be integers or slices, not str" – Tad Aug 30 '19 at 19:34
  • edited my question with new input..thanks for the reply!! – Tad Aug 30 '19 at 19:47
  • 1
    The error is generated when it is not a `list`, that is, with np.nan, so a condition is added to check if it is a list. If it is not, do not try to perform `json_normalize`, avoiding the error. I have uploaded this code at the end of my answer. Tell me if you can accept the answer – ansev Aug 30 '19 at 20:22
3

Target_df=source_df.apply(json_normalize)

PySeeker
  • 818
  • 8
  • 12