0

This is the link of the geojson, it is a FeatureCollection which I later convert into simple JSON. it has 66,153 Records. and the size is 174MB

https://opendata.arcgis.com/datasets/a779d051865f461eb2a1f50f10940ec4_161.geojson

I get the data using requests in python.

after getting the response I pass the response to a function and try to load data with json.loads(), which takes quite some time even after that, it is killed by the manjaro. , I have 12 gigs of ram.

def getJson(document):
    a = json.loads(document.text)    <= GETS STUCK HERE
    del document
    try:
        p = json.loads('[]')
        for i in a['features']:
            print('Detected GeoJson of type FeatureCollection')
            g = json.loads('{}')
            for key, value in i.items():
                if key == 'properties':
                    for k, v in value.items():
                        g.update({k:v})
                elif key != 'type':
                    try:
                        x = shape(value)
                        g.update({key:x.wkt})
                    except Exception as e:
                        g.update({key:value})
                
                p.append(g)
        return p
Potato
  • 87
  • 1
  • 7
  • This might be helpful to you: https://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python – Harshana Nov 22 '20 at 14:17
  • Thanks, But for my use case i cant stream it, I have to process the whole block. I'm using this function in an API that converts geojson to JSON. it works with small GeoJson, but doesn't work in this case. – Potato Nov 22 '20 at 17:49
  • The issue seems to be somewhere else. I `wget` the file and your code loads 198459 features. Note: (1) you never closed the outside `try` (2) you can create a list with `[]` and a dict with `{}` - no need for `json.loads('{}')`. Can you also show us the http/loading code? – urban Nov 23 '20 at 11:20
  • Well, Oh yes sorry I missed the except part because that has another function in it, which would be irrelevant to this question. and i make a simple request. like requests.get(URL) – Potato Nov 24 '20 at 01:10

3 Answers3

1

I am not sure how you load the document, but requests do support .json() on a response. The following code works for me:

import json
import requests

URL = "https://opendata.arcgis.com/datasets/a779d051865f461eb2a1f50f10940ec4_161.geojson"

def getJson(a):
    p = []
    for i in a['features']:
        print('Detected GeoJson of type FeatureCollection')
        g = {}
        for key, value in i.items():
            if key == 'properties':
                for k, v in value.items():
                    g.update({k:v})
            elif key != 'type':
                try:
                    x = shape(value)
                    g.update({key:x.wkt})
                except Exception as e:
                    g.update({key:value})
            
            p.append(g)
    return p

# Load it as JSON
print("Downloading")
r = requests.get(URL)

# Process the response as JSON
print("Processing")
tmp = getJson(r.json())

print(f"Loaded {len(tmp)} features")

output:

...
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Detected GeoJson of type FeatureCollection
Loaded 198459 features
urban
  • 5,392
  • 3
  • 19
  • 45
0

Here's a solution that also downloads the data from that URL (through streaming). If you load the json file normally there shouldn't be a problem:

import json
import shutil

import requests


def store_geojson():
    url = 'https://opendata.arcgis.com/datasets/a779d051865f461eb2a1f50f10940ec4_161.geojson'
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    return local_filename


def load_geojson(path):
    with open(path, 'r') as f:
        data = json.load(f)
    print(data.keys())


path = store_geojson()
load_geojson(path)

will print:

dict_keys(['type', 'name', 'crs', 'features'])

meaning you have access to the loaded JSON

dh762
  • 2,259
  • 4
  • 25
  • 44
  • Oh, great thanks for the solution i will definitely try it, and shutil is also new for me! But isn't memory operations is supposed to be faster than loading from a file ? – Potato Nov 24 '20 at 01:16
  • sure, but when developing you do not need to download the file over and over again, potentially risking a lockout or rate limit. So in dev it is faster. – dh762 Nov 24 '20 at 05:33
0

I would propose 2 solutions for the job

  1. You may not want to use json.loads for the job (actually json). instead, you should look up something like pandas which has a function called read_json that will load your json data into what's called a data frame. kaggle is a good place to learn pandas.

  2. there is a library called ijson which was created to address the problem of loading a large JSON dataset by loading the data part at a time. there are more solutions for this problem in this answer. but I would prefer the first solution because pandas is much more efficient in handling large data sets.

lastly, you may want to study asynchronous programming because you could build solutions for this type of problem as this is a disk reading operation. and that will make you cool in front of your fellow developers because this is a very advanced topic.

Hyperx837
  • 773
  • 5
  • 13
  • Oh Alright! Got it, response is already JSON so i think i can use response.json() (as urban has suggested). I will definitely try both of them, because i encounter fairly large datasets in my use case, sometimes millions of records. Thanks for the reply. Looking into asynchronous process now :) – Potato Nov 24 '20 at 01:19