json.loads() takes forever then linux kills the process

Question

This is the link of the geojson, it is a FeatureCollection which I later convert into simple JSON. it has 66,153 Records. and the size is 174MB

https://opendata.arcgis.com/datasets/a779d051865f461eb2a1f50f10940ec4_161.geojson

I get the data using requests in python.

after getting the response I pass the response to a function and try to load data with json.loads(), which takes quite some time even after that, it is killed by the manjaro. , I have 12 gigs of ram.

def getJson(document):
    a = json.loads(document.text)    <= GETS STUCK HERE
    del document
    try:
        p = json.loads('[]')
        for i in a['features']:
            print('Detected GeoJson of type FeatureCollection')
            g = json.loads('{}')
            for key, value in i.items():
                if key == 'properties':
                    for k, v in value.items():
                        g.update({k:v})
                elif key != 'type':
                    try:
                        x = shape(value)
                        g.update({key:x.wkt})
                    except Exception as e:
                        g.update({key:value})
                
                p.append(g)
        return p

This might be helpful to you: https://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python — Harshana, Nov 22 '20 at 14:17
Thanks, But for my use case i cant stream it, I have to process the whole block. I'm using this function in an API that converts geojson to JSON. it works with small GeoJson, but doesn't work in this case. — Potato, Nov 22 '20 at 17:49
The issue seems to be somewhere else. I `wget` the file and your code loads 198459 features. Note: (1) you never closed the outside `try` (2) you can create a list with `[]` and a dict with `{}` - no need for `json.loads('{}')`. Can you also show us the http/loading code? — urban, Nov 23 '20 at 11:20
Well, Oh yes sorry I missed the except part because that has another function in it, which would be irrelevant to this question. and i make a simple request. like requests.get(URL) — Potato, Nov 24 '20 at 01:10

score 1 · Accepted Answer · answered Nov 23 '20 at 11:24

I am not sure how you load the document, but requests do support .json() on a response. The following code works for me:

import json
import requests

URL = "https://opendata.arcgis.com/datasets/a779d051865f461eb2a1f50f10940ec4_161.geojson"

def getJson(a):
    p = []
    for i in a['features']:
        print('Detected GeoJson of type FeatureCollection')
        g = {}
        for key, value in i.items():
            if key == 'properties':
                for k, v in value.items():
                    g.update({k:v})
            elif key != 'type':
                try:
                    x = shape(value)
                    g.update({key:x.wkt})
                except Exception as e:
                    g.update({key:value})
            
            p.append(g)
    return p

# Load it as JSON
print("Downloading")
r = requests.get(URL)

# Process the response as JSON
print("Processing")
tmp = getJson(r.json())

print(f"Loaded {len(tmp)} features")

output:

...
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Loaded 198459 features

But what happens if response is not JSON, lets say if response is CSV ! is it going to raise exception ? — Potato, Nov 24 '20 at 01:27
@Usman5251 I am not sure, in such case it might raise or return empty... I will need to test — urban, Nov 24 '20 at 09:23
I have checked it raises an JSONDecodeError!, So this is perfect i think ! Easiest solution so far. — Potato, Nov 24 '20 at 20:37

dh762 · Answer 2 · 2020-11-23T11:05:09.163

0

Here's a solution that also downloads the data from that URL (through streaming). If you load the json file normally there shouldn't be a problem:

import json
import shutil

import requests


def store_geojson():
    url = 'https://opendata.arcgis.com/datasets/a779d051865f461eb2a1f50f10940ec4_161.geojson'
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    return local_filename


def load_geojson(path):
    with open(path, 'r') as f:
        data = json.load(f)
    print(data.keys())


path = store_geojson()
load_geojson(path)

will print:

dict_keys(['type', 'name', 'crs', 'features'])

meaning you have access to the loaded JSON

edited Nov 23 '20 at 11:05

answered Nov 23 '20 at 10:40

dh762

2,259
4
25
44

Oh, great thanks for the solution i will definitely try it, and shutil is also new for me! But isn't memory operations is supposed to be faster than loading from a file ? – Potato Nov 24 '20 at 01:16
sure, but when developing you do not need to download the file over and over again, potentially risking a lockout or rate limit. So in dev it is faster. – dh762 Nov 24 '20 at 05:33

Hyperx837 · Answer 3 · 2020-11-23T12:16:52.457

I would propose 2 solutions for the job

You may not want to use json.loads for the job (actually json). instead, you should look up something like pandas which has a function called read_json that will load your json data into what's called a data frame. kaggle is a good place to learn pandas.
there is a library called ijson which was created to address the problem of loading a large JSON dataset by loading the data part at a time. there are more solutions for this problem in this answer. but I would prefer the first solution because pandas is much more efficient in handling large data sets.

lastly, you may want to study asynchronous programming because you could build solutions for this type of problem as this is a disk reading operation. and that will make you cool in front of your fellow developers because this is a very advanced topic.

Oh Alright! Got it, response is already JSON so i think i can use response.json() (as urban has suggested). I will definitely try both of them, because i encounter fairly large datasets in my use case, sometimes millions of records. Thanks for the reply. Looking into asynchronous process now :) — Potato, Nov 24 '20 at 01:19

json.loads() takes forever then linux kills the process

3 Answers3