1

I'm trying to create a Flask web app that can query a NetworkX graph, to return the shortest paths (using Dijkstra's algorithm, plus Yen's algorithm for the k-shortest paths).

Here is my sample code (stripped of error handling et c., and only showing the Dijkstra implementation), which returns the path if you submit a GET request to localhost:3000/d?source=a&target=b:

import csv
import json
import networkx as nx
from flask import Flask, request

app = Flask(__name__)
G = nx.Graph()

@app.route('/d')
def dijkstra():

  global G

  source = request.args['source'].lower().strip()
  target = request.args['target'].lower().strip()

  path = nx.dijkstra_path(G, source, target, weight='weight')

  return json.dumps(path)


if __name__ == '__main__':

  global G

  with open('nodes.csv', 'rb') as f:
    reader = csv.reader(f)
    node_list = list(reader)

  for i in nodes:
    G.add_node(i[0])

  with open('edges.csv', 'rb') as f:
    reader = csv.reader(f)
    edge_list = list(reader)

  for i in edges:
    G.add_edge(i[0], i[1], weight=i[2])

  app.run(host='0.0.0.0', port=3000, use_reloader=False)

The graph is very large (one million nodes, twenty million edges), so I am currently populating from CSVs it as the app loads, rather than building it for every request. This takes about five minutes when run locally on my MacBook Pro (3GHz, 16GB RAM). Lookups take different amounts of time, but generally around fifteen seconds, and performance generally degrades after a certain amount of use, meaning I have to restart the app.

Question one: as this is slow and pretty memory intensive, is there a way to store this graph so that I don't have to generate it every time, then hold it in memory?

Question two: am I approaching it the right way, by using G as a global variable?

I haven't worked with graphs this large before, so am aware my implementation is probably far from ideal, and would be grateful for any thoughts about how to make this perform better/more reliably!

Chris
  • 151
  • 10
  • PS here are sample timings from a version of the script that logs each stage, so you can see where the time is being spent: **3.6** seconds from start to load 1,014,896 nodes; 99.0 seconds from start (∴ **95.4** seconds) to load 20,025,902 edges; 101.3 seconds from start (∴ **5.9** seconds) to add nodes to graph; 330.9 seconds from start (∴ **229.6** seconds) to add edges to graph. – Chris May 11 '18 at 14:37

1 Answers1

2

Did you try using caches available in the Market?

Create a background script which updates these caches or pickles and your request only loads and sent the data inside it.

EDIT As OP have also found the Builtin Pickles Method on NetworkX enter link description here

To compare the speed for pickling and directly dealing with CSV Check this link

Edit 2 When you pickle the entire data structure, you are limited by system RAM. You can, however, do it in chunks.

streaming-pickle can be a solution for pickling objects larger than memory on board. Because problems occur while unpickling but you can use this approach to overcome that as well.

or with the Redis approach, if you want data to be held under the request is finished you can look for Redis Cluster

This is one big hell of a questions :D depends on what you actually need.

NoorJafri
  • 1,787
  • 16
  • 27
  • Thanks for the quick reply! Looks like NetworkX has a function for reading/writing pickles (https://networkx.github.io/documentation/networkx-1.9.1/reference/readwrite.gpickle.html). Is that what you meant? Will this not create the same memory problems when the whole graph has to be loaded? – Chris May 11 '18 at 13:41
  • Wunderbar, if it has the picke write built in :D. You can compare the response time here https://stackoverflow.com/a/37012035/6198978 Pickles and HDF5 are much faster than CSV. – NoorJafri May 11 '18 at 13:50
  • Trying it out now—will let you know. Thanks again! :) – Chris May 11 '18 at 14:16
  • No luck :( it spent about 20 minutes trying to create the pickle, then I got a `Killed: 9` message. Any other ideas? – Chris May 11 '18 at 14:43
  • StackOverflow or OutofMem? Whats the error does it show? – NoorJafri May 11 '18 at 14:48
  • Literally just `Killed: 9`, which is a SIGKILL that "cannot be caught or ignored" (http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html). My guess would be out of memory :( – Chris May 11 '18 at 14:50
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/170863/discussion-between-noor-ali-jafri-and-chris). – NoorJafri May 11 '18 at 14:53
  • Summary: this was a huge help for reducing the initial load time, as `networkx.read_gpickle()` loads the graph in about 75 seconds, vs the current average of 350. So thank you very much indeed! I still have an outstanding question about whether my approach is correct or not, though: holding a huge graph like this in memory makes things run pretty slowly, and performance degrades after a while. Is there a better way to work with such a large graph? – Chris May 11 '18 at 15:51
  • Thank you! You have definitely answered question one for me. I now have a pickle that loads in a quarter of the time. Next up: what do you think about the approach of loading the whole graph into memory? Is this the only way to go, or is there a better way that will preserve continuous performance? – Chris May 11 '18 at 19:48
  • Yup check the edit I have mentioned about the streaming pickle. – NoorJafri May 11 '18 at 23:02
  • Finally got a solution that works pretty well (still needs some optimisation, but definitely "good enough" for now), thanks to your very kind help. Thank you so much! :D – Chris May 14 '18 at 22:14
  • Anytime ;) i love such problems – NoorJafri May 14 '18 at 22:16