0

I have a geoJSON file, which contains a breakdown of a certain geographical area into ca. 7000 cells. I'd like to a) open this geoJSON b) modify some data (see code bellow) and c) write this modified geoJSON to the disk. Now, my problem is, that since there's a lot of cells, this takes almost a minute. Do you see any way to improve the speed of this function? Thank you!

def writeGeoJSON(param1, param2, inputdf):
    with open('ingeo.geojson') as f:
        data = json.load(f)
    for feature in data['features']: 
        currentfeature = inputdf[(inputdf['SId']==feature['properties']['cellId']) & (inputdf['param1']==param1) & (inputdf['param2']==param2)]
        if (len(currentfeature) > 0):
            feature['properties'].update({"style": {"opacity": currentfeature.Opacity.item()}})
        else:
            feature['properties'].update({"style": {"opacity": 0}})
    end = time.time()
    with open('outgeo.geojson', 'w') as outfile:
        json.dump(data, outfile)
Cœur
  • 37,241
  • 25
  • 195
  • 267
lte__
  • 7,175
  • 25
  • 74
  • 131
  • 1
    Maybe calculate `inputdf[(inputdf['param1']==param1) & (inputdf['param2']==param2)]` outside of the loop. – Alex Hall Feb 28 '17 at 10:37
  • First, [profile](http://stackoverflow.com/a/582337/3005167) your function to find out if the bottleneck is in i/o or processing. Anything else is just fishing in the dark – MB-F Feb 28 '17 at 10:43
  • @Alex Excellent! I've mentioned the same point in my answer. I noticed it the very minute the question was posted... I've taken 9 minutes to answer this question from my mobile! xD – varun Feb 28 '17 at 10:55
  • 1
    Did you try to store this `{"style": {"opacity": 0}}` in a constant ? Not sure if you will gain some perf with that, but if Python has to recreate each time this object, you can definetely be faster. Also, `feature['properties']` store this one into a variable could help. But again, I'm not sure. Python might access to the value faster than the variable creation, plus the access. Just make some tests. – romain-aga Feb 28 '17 at 11:42
  • @romain-aga I think you cannot store `{"style": {"opacity": 0}}` in a constant, since that's not a variable, but rather you're telling the program which fields of the JSON to update. – lte__ Feb 28 '17 at 11:49
  • @romain-aga So the only constant there is the 0. – lte__ Feb 28 '17 at 11:50
  • @lte__ `{ ... }` is a dictionary, it's an object. You can store it as a constant – romain-aga Feb 28 '17 at 13:16

2 Answers2

2

There is a serial code optimization possible in your code. You have the line:

currentfeature = inputdf[(inputdf['SId']==feature['properties']['cellId']) & (inputdf['param1']==param1) & (inputdf['param2']==param2

Notice that the last two checks can be put outside the for loop. It is a redundant check which takes up many CPU clock cycles for each iteration in the for loop!!! You can modify the same as:

paramMatch=inputdf['param1']==param1 & inputdf['param2']==param2
for feature in data['features']: 
    currentfeature = inputdf[(inputdf['SId']==feature['properties']['cellId']) & paramMatch]

That must make your program run much faster!

That said, if you need better execution times(most probably not necessary), try using the multiprocessing module to parallelize the processing part of the code. You can try to split the work load in the for loop.

Try using apply_async or map_async to a block of iterations to speed things up!

varun
  • 2,027
  • 1
  • 15
  • 17
  • Thanks, that cut running time by 60%! I'll try to make it run in parallel, too. :) – lte__ Feb 28 '17 at 11:31
  • Awesome! Make sure you avoid the `Threading` module and use the `multiprocessing` module if you're on Cpython (the default Python distribution). Otherwise, that can be an option too! – varun Feb 28 '17 at 11:41
  • 1
    Yeah I've been working with `multiprocessing` before in a few uni assignments, so I'm familiar :) As much as I know, `threading` doesn't do real parallel computing bc of the GIL. :) – lte__ Feb 28 '17 at 11:46
  • Perfectly put, m8! All the best! :D – varun Feb 28 '17 at 11:50
1

[In addition to @varun optimization, and including a @romain-aga suggestion.]

Add this at the beginning of the function:

zero_style = {"opacity": 0}

And change the conditional to become:

if (len(currentfeature) > 0):
    feature['properties']['style'] = {"opacity": currentfeature.Opacity.item()}
else:
    feature['properties']['style'] = zero_style

I have the impression that knowing more about inputdf type would lead to better optimization (maybe direct if currentfeature: is enough? maybe is better?)


Assuming CPython, I expect this to be better (better ask for forgiveness than for permission):

try:
    value = {"opacity": currentfeature.Opacity.item()}
except NotSureWhatExceptionMaybeAttributeError:
    value = zero_style
feature['properties']['style'] = value
MariusSiuram
  • 3,380
  • 1
  • 21
  • 40
  • +1 for that detailed observation. That's an interesting insight into the Python dictionary creation overhead in each iteration. I'll experiment and check if it makes a difference. I think it will make it a bit faster! :) – varun Mar 06 '17 at 16:27