In Python I'd take a look at GeoPandas. It provides a data structure called GeoDataFrame: it's a list of features, each one having a geometry and some optional attributes. You can join two GeoDataFrames together based on geometry intersection, and you can aggregate the numbers of rows (say, trips) within a single geometry (say, postcode).
- I'm not familiar with Uber's data, but I'd try to find a way to get it into a GeoPandas GeoDataFrame.
- Likewise postcodes can be downloaded from places like the U.S. Census, OpenStreetMap[1], etc, and coerced into a GeoDataFrame.
- Join #1 to #2 based on geometry intersection. You want a new GeoDataFrame with one row per Uber trip, but with the postcode attached to each. Another StackOverflow post discusses how do to this, and it's currently harder than it ought to be.
- Aggregate this by postcode and count the trips in each. The code will look like
joined_dataframe.groupby('postcode').count()
.
My fear for the above process is if you have hundreds of thousands of very complex trip geometries, it could take forever on one machine. The link you posted uses Spark and you may end up wanting to parallelize this after all. You can write Python against a Spark cluster(!) but I'm not the person to help you with this component.
Finally, for the prediction component (e.g. SGD), check out scikit-learn: it's a pretty fully featured machine learning package, with a dead simple API.
[1]: There is a separate package called geopandas_osm that grabs OSM data and returns a GeoDataFrame: https://michelleful.github.io/code-blog/2015/04/27/osm-data/