0

I have a problem. I want to normalize with pd.json_normalize(...) a list with inside dict but unfortunately I got a MemoryError. Is there an option to work around this error? Well it worked with pd.json_normalize(my_data[:2000000], sep="_") but not with the complete data (2549150)

I looked at MemoryError: Unable to allocate MiB for an array with shape and data type, when using anymodel.fit() in sklearn , and Unable to allocate array with shape and data type

my_data = [
{'_id': 'orders/213123',
 'contactEditor': {'name': 'Max Power',
  'phone': '1234567',
  'email': 'max@power.com'},
 'contactSoldToParty': {'name': 'Max Not',
  'phone': '123456789',
  'email': 'maxnot@power.com'},
 'isCompleteDelivery': False,
 'metaData': {'dataOriginSystem': 'Goods',
  'dataOriginWasCreatedTime': '10:12:12',},
 'orderDate': '2021-02-22',
 'orderDateBuyer': '2021-02-22',
},
{'_id': 'orders/12323',
 'contactEditor': {'name': 'Max Power2',
  'phone': '1234567',
  'email': 'max@power.com'},
 'contactSoldToParty': {'name': 'Max Not',
  'phone': '123456789',
  'email': 'maxnot@power.com'},
 'isCompleteDelivery': False,
 'metaData': {'dataOriginSystem': 'Goods',
  'dataOriginWasCreatedTime': '10:12:12',},
 'orderDate': '2021-02-22',
 'orderDateBuyer': '2021-02-22',
 },
]

df = pd.json_normalize(my_data, sep="_")
[OUT]
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_11136/3519902863.py in <module>
----> 1 df= pd.json_normalize(my_data, sep='_')
MemoryError: Unable to allocate 1.88 GiB for an array with shape (2549150, 99) and data type object

What I want

id             contactEditor_name contactEditor_phone contactEditor_email ...
orders/213123  Max Power          ...                 ...                 ...
orders/12323   Max Power2         ...                 ...                 ...

Length of len(my_data) is 2549150`

Test
  • 571
  • 13
  • 32
  • kindly post the expected output dataframe. Also, what is the size (MB/GB) of the json file? – sammywemmy May 05 '22 at 07:22
  • @sammywemmy I added the expected output. And I can't say the size of the json file, because I import that from a ArangoDB. – Test May 05 '22 at 07:24
  • Does this answer your question? [Conver dict into dataframe with nested keys](https://stackoverflow.com/questions/72122866/conver-dict-into-dataframe-with-nested-keys) – Laurent May 07 '22 at 08:06
  • @Laurent the linked question is a different. – Test May 09 '22 at 06:38
  • Error message informs you that the pandas package needs 1.88 GiB (~2.02 GB) of memory. Whichever approach you use to convert JSON to pd.DataFrame, this memory is required to store the result of this transformation, so you almost have no options to decrease this value. But it's not a lot actually. If you do that on the local machine, try to restart it and avoid running memory-consuming apps. If it doesn't work, you should make the further processings of the DataFrame chunk-wise. – Boris Silantev May 12 '22 at 01:34
  • Does the list of dicts come from a file, or is it constructed within your program? – BeRT2me May 14 '22 at 02:33

3 Answers3

2

You can try this simple walkaround.

split_index = 1500000
df = pd.concat([
    pd.json_normalize(my_data[:split_index], sep="_"),
    pd.json_normalize(my_data[split_index:], sep="_"),
], ignore_index=True)

If memory restriction is very tight, you may need to split it into more than 3 segments.

ken
  • 1,543
  • 1
  • 2
  • 14
2

I have faced similar issues previously. You should split the DataFrame, as you have already mentioned in your question. Do it for X number rows first, then X more and so on. Finally combine the results into a single Dataframe.

Haakh
  • 66
  • 7
0

As @ken has pointed, splitting in smaller sections may solve the problem.

However, in the example data that you posted, there are around 12 columns, not 99. Could it be that one of the entries is corrupted and messing up your parsing?

How does your loaded 2M rows look like if you ask df.info()?

You could also do a kind of "lazy loading" using max_level=1 and then expanding items when needed. It may help if problem comes from many nested levels.

Zaero Divide
  • 699
  • 2
  • 10