0

I have been doing some investigation to find a package to install and use for Geospatial Analytics

The closest I got to was https://github.com/harsha2010/magellan - This however has only scala interface and no doco how to use it with Python.

I was hoping if you someone knows of a package I can use?

What I am trying to do is analyse Uber's data and map it to the actual postcodes/suburbs and run it though SGD to predict the number of trips to a particular suburb.

There is already lots of data info here - http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/#comment-606532 and I am looking for ways to do it in Python.

GreenThumb
  • 483
  • 1
  • 7
  • 25
  • Python has a google maps api that could probably get what you need (https://github.com/googlemaps/google-maps-services-python) – wgwz Oct 30 '15 at 00:48

2 Answers2

0

In Python I'd take a look at GeoPandas. It provides a data structure called GeoDataFrame: it's a list of features, each one having a geometry and some optional attributes. You can join two GeoDataFrames together based on geometry intersection, and you can aggregate the numbers of rows (say, trips) within a single geometry (say, postcode).

  1. I'm not familiar with Uber's data, but I'd try to find a way to get it into a GeoPandas GeoDataFrame.
  2. Likewise postcodes can be downloaded from places like the U.S. Census, OpenStreetMap[1], etc, and coerced into a GeoDataFrame.
  3. Join #1 to #2 based on geometry intersection. You want a new GeoDataFrame with one row per Uber trip, but with the postcode attached to each. Another StackOverflow post discusses how do to this, and it's currently harder than it ought to be.
  4. Aggregate this by postcode and count the trips in each. The code will look like joined_dataframe.groupby('postcode').count().

My fear for the above process is if you have hundreds of thousands of very complex trip geometries, it could take forever on one machine. The link you posted uses Spark and you may end up wanting to parallelize this after all. You can write Python against a Spark cluster(!) but I'm not the person to help you with this component.

Finally, for the prediction component (e.g. SGD), check out scikit-learn: it's a pretty fully featured machine learning package, with a dead simple API.

[1]: There is a separate package called geopandas_osm that grabs OSM data and returns a GeoDataFrame: https://michelleful.github.io/code-blog/2015/04/27/osm-data/

Community
  • 1
  • 1
Jeff G
  • 908
  • 8
  • 18
  • Thanks @Jeff G, I'll take a look at GeoPandas, Uber's data has only lat and long (point data structure) Goolge maps API is good I think, but the only drawback is that I need to be connected and the cluster is not exposed to internet. So I'll have to do with offline like GeoPandas or magellan Yes, I have SGD ready with scikit learning for this. If I get this working with Python, I can use hadoop-streaming that'll help with map reduce (using cluster). – GreenThumb Oct 30 '15 at 12:28
  • I have been trying to install geopandas since 3 hours, stuck at Fiona – GreenThumb Oct 30 '15 at 19:46
  • The Python Geospatial ecosystem relies on a lot of C libs and brings the possibility of conflicts with your system. Though not a 100% sure bet, I tend to have luck keeping my Python libs separate from system libs using Conda. This should let you install fiona/geopandas in an isolated environment for your project. – Jeff G Nov 01 '15 at 19:57
0

I realize this is an old questions, but to build on Jeff G's answer.

If you arrive at this page looking for help putting together a suite of geospatial analytics tools in python - I would highly recommend this tutorial.

https://geohackweek.github.io/vector

It really picks up steam in the 3rd section.

It shows how to integrate

  1. GeoPandas
  2. PostGIS
  3. Folium
  4. rasterstats

add in scikit-learn, numpy, and scipy and you can really accomplish a lot. You can grab information from this nDarray tutorial as well

BDHudson
  • 21
  • 3