Is there a better way of splitting this JSON string other than regex?

Question

I'm working on a vehicle routing problem and by far the most time consuming part is generating a distance/time matrix using real road networks. For this project I use a localhost graphhopper server and I used to use to get pair-wise distances and times between locations; everything was straightforward to parse. It returns individual driving instructions and also a total journey distance and time. But, for 2000 locations, this would be nearly 4mil calls.

On a hunch I wondered if it would be faster if I asked it to instead plan a route that went via 100 locations at a time. The aim then would be to break the json response down into 100 blocks as though I'd made 100 pair-wise calls. Turns out it's nearly 3 times faster for 100 locations. Apart from a little headache of trying to keep track of what parts of my matrix are built, the response gives no obvious structural difference for each sub-route, but instead just gives the instruction that you are at a "Stopover" e.g.

{
    "hints": {
        "visited_nodes.average": "243.0",
        "visited_nodes.sum": "3402"
    },
    "paths": [{
        "instructions": [{
            "distance": 315.637,
            "sign": 0,
            "interval": [0, 7],
            "text": "Continue",
            "time": 29848
        }, {
            "exit_number": 3,
            "distance": 1460.234,
            "sign": 6,
            "turn_angle": -0.01,
            "interval": [7, 27],
            "text": "At roundabout, take exit 3",
            "time": 96167
        }, {
            "distance": 1258.763,
            "sign": 0,
            "interval": [27, 30],
            "text": "Continue onto Sheikh Zayed bin Sultan Street",
            "time": 50350
        }, {
            "distance": 116.7,
            "sign": 0,
            "interval": [30, 31],
            "text": "Continue",
            "time": 4668
        }, {
            "distance": 3543.556,
            "sign": 0,
            "interval": [31, 57],
            "text": "Continue onto Sheikh Zayed bin Sultan Street",
            "time": 144812
        }, {
            "distance": 0,
            "sign": 5,
            "interval": [87, 88],
            "text": "Stopover 1",
            "time": 0
        },

My current working prototype is to keep summating the individual distances (meters) for each instruction but use regex to look for "Stopover" before starting afresh recording the distances.

distance_matrix = []
sub_distance = []
for unit in response['paths'][0]['instructions']:
    if bool(re.search('Stopover', unit['text'])) == False:
        sub_distance.append(float(unit['distance'])/1000)
    else:
        distance_matrix.append(sum(sub_distance))
        sub_distance = []

My background is not in programming at all and I'm self-taught in Python for less than a year. Whilst this works, I'm wondering if I'm missing a more natural (and faster) way to do this but I can't find anything related via Google because I'm not quite sure what I'm searching for. At our current scale the distance/time matrix can already take ~8 mins to compute so I'm open to any suggestions that shave off time. There appears to be a limit to the number of points in a route so this function will still be called many thousands of times so it will add up. Thanks in advance.

Henrik Gustafsson · Answer 1 · 2016-02-17T19:02:30.267

2

Is there a reason why you store the list of sub distances? If all you are doing is calculating a sum, it seems to me it would be faster to do something like this:

distance_matrix = []
sub_distance = 0.0
for unit in response['paths'][0]['instructions']:
    if not unit['text'].startswith('Stopover'):
        sub_distance += unit['distance']
    else:
        distance_matrix.append(sub_distance / 1000.0)
        sub_distance = 0.0

A small performance test:

import re
import timeit
import random

# Creates a list of 1000000 items with random distances from [0, 1). On average every 10th element will be a stopover.
data = [{"distance": random.random(), "text": "Stopover" if random.randint(0, 10) == 0 else "ASDF"} for i in range(1000000)]

def test1():
  distance_matrix = []
  sub_distance = 0.0
  for unit in data:
    if not unit['text'].startswith('Stopover'):
      sub_distance += unit['distance']
    else:
      distance_matrix.append(sub_distance / 1000.0)
      sub_distance = 0.0

  return distance_matrix

def test2():
  distance_matrix = []
  sub_distance = []
  for unit in data:
    if bool(re.search('Stopover', unit['text'])) == False:
      sub_distance.append(float(unit['distance'])/1000)
    else:
      distance_matrix.append(sum(sub_distance))
      sub_distance = []

  return distance_matrix

print "Test 1: %.2f s for 10 runs" % timeit.timeit(test1, number=10)
print "Test 2: %.2f s for 10 runs" % timeit.timeit(test2, number=10)

Yields the following on my machine:

$ python test.py
Test 1: 3.50 s for 10 runs
Test 2: 16.67 s for 10 runs

As an aside, I should mention that a pre-compiled regular expression is almost as fast as using startswith(), which could come in handy at some other time.

Pre-compile outside the loop like this:

stop = re.compile("Stopover")

And use in the loop like this:

if not stop.search(unit['text']):

edited Feb 17 '16 at 19:02

answered Feb 17 '16 at 18:27

Henrik Gustafsson

51,180
9
47
60

That is a good observation, thanks. Indeed, I think I added the `sub_distance` to concentrate on the way I was going to split the list. Before I test, I'm interested in whether you know that this gives a performance boost for some reason? – roganjosh Feb 17 '16 at 18:30
1

Lists are expensive to create relative to updating a value, and you will be creating one new list for each stopover. Also, the extra iteration over the list when doing the sum is not without cost. Finally I moved the division into the else statement to avoid doing divisions on each iteration that were not stopovers (divisions are not as expensive as creating a list, but significantly more expensive than a simple addition) – Henrik Gustafsson Feb 17 '16 at 18:31
Really great feedback, it is much appreciated. I have given an upvote to answer and comment, but I feel the answer from Daniel answers the title of the thread so I should accept that to close the question as it deals with breaking the json string. – roganjosh Feb 17 '16 at 18:48
Certainly! I added a performance test showing the performance improvements of the accumulation method over the list appending and summing. – Henrik Gustafsson Feb 17 '16 at 18:52
The peripheral info in this post is truly fantastic. I am so grateful, seriously. – roganjosh Feb 17 '16 at 19:35
1

No problem :) I'm curious what the performance impact is when running on your data, so if you chose to go this route and you have the time, please share the updated numbers once you have incorporated my changes – Henrik Gustafsson Feb 17 '16 at 20:06
For sure I will. I'm not at my computer currently so it may take up to a day or two to get you accurate benchmarks. It's great to have someone who will elaborate on their point around the main question; sometimes it can seem that some people don't appreciate there are people thrown right in at the deep end and might be missing other fundamental parts :) The least I can do is give quantitative feedback on your methods. – roganjosh Feb 17 '16 at 20:15

score 1 · Accepted Answer · answered Feb 17 '16 at 18:02

1

There's not much point using regex here as you are not using any of its features. It's much simpler, and probably will be faster, to use startswith here.

if unit['text'].startswith('Stopover'):

Note, you shouldn't compare booleans with False explicitly, just do if <condition>:.

answered Feb 17 '16 at 18:02

Daniel Roseman

588,541
66
880
895

Many thanks for the feedback. I just tested with 5 times each loop (to give responsive feedback). The average time for my setup was 97600ms, yours gave 92800ms. I'll do some further tests and leave this open for a bit just in case before accepting :) – roganjosh Feb 17 '16 at 18:17

score 1 · Answer 3 · edited May 23 '17 at 11:50

1

I agree with Daniel, you don't need regex for what you want to do.

If your Instruction starts with the command you are searching for (e.g. Stopover) use the .startswith method.

If the instructions you are going to look are not at the start of the string though, you can use the unit['text'].find('stopover') which works like the regex, or for further simplifying the search use:

distance_matrix = []
sub_distance = []
for unit in response['paths'][0]['instructions']:
    if 'Stopover' in unit['text']:
        distance_matrix.append(sum(sub_distance))
        sub_distance = []
    else:
        sub_distance.append(float(unit['distance'])/1000)

String searching can't be much faster though, you won't see any very important speedups on this part of the procedure.

Edit: Just found this question which explains some performance stuff considering string searching. Check it out.

edited May 23 '17 at 11:50

Community

1
1

answered Feb 17 '16 at 18:24

Greg K.

686
1
5
18

Thanks for the elaboration. I'm not sure on the relevance of the posted answer because I'm looking to split a json string multiple times in a complete way and so I already know that it contains the things I want to split by... but splitting on `in` would leave me with invalid json? – roganjosh Feb 17 '16 at 18:42
Not at all, i'm just proposing an alternative method of searching for the instruction within the json struct.I'll update my code to make it more clear. The json is not manipulated at all. The question i provided just contains more information about gaining some performance on the string search. – Greg K. Feb 17 '16 at 18:52
My apologies, I misunderstood the intended application of the link to another answer. The benchmarks there are very interesting. Upvoted. I thought `re` would be slower, but not to that extent! – roganjosh Feb 17 '16 at 20:42

Is there a better way of splitting this JSON string other than regex?

3 Answers3