3

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.

With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.

My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.

I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.

The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.

I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.

Any help would be appreciated.

Marvin Pinto
  • 30,138
  • 7
  • 37
  • 54
Daegalus
  • 1,420
  • 1
  • 10
  • 8
  • Why are you discounting the possibility of using relational databases, especially if "the information is very tightly interconnected"? – JeremyP Oct 03 '11 at 09:54
  • Because the way the data is setup, having a normal relational database will require a lot of queries for just to get a set of data. Plus I have been leaning towards Neo4j because the GraphDB seems to fit my problem nicely. – Daegalus Oct 06 '11 at 08:00

3 Answers3

1

Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.

I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.

BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.

purbon
  • 193
  • 1
  • 1
  • 7
  • I have come to the same conclusion, and have decided to use Neo4j. I did not know about their spatial plugin, but after it was pointed out ot me. Neo4j is what I need. So I am going with that. Not sure which answer to give the answer to, because Martin's answer is equally acceptable and might be an viable approach for my project. – Daegalus Oct 18 '11 at 06:18
1

I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.

For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.


Added 2011-02-12:

Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."

I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:

I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.

I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.

When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.

That said, and this deserves more intense thought, here are some considerations that may be helpful:

  • I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
  • Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?

One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..

Community
  • 1
  • 1
Jordan
  • 5,085
  • 7
  • 34
  • 50
  • This is actually a very viable solution, and I might actually go this hybrid route. I have been backed up a bit with school, so I haven't had a chance to get any of it really setup and start importing data, but this might be a good solution to use. MongoDB for the check-ins/social, Neo4j for the actual Transit data and relationships. I will be doing this in JS since it will be a web-based service with a Node.js backend and a bit in the front-end too. But I will be writing apps for WP7, iOS and Android, so some C# but that will all be just private REST API calls to get the data. – Daegalus Oct 18 '11 at 06:19
  • I am also working on porting the neo4js client that they have to a Node.js module. Seems simple enough to add the difference in actually making the HTTP calls through the Node HTTP library instead of Ajax. – Daegalus Oct 18 '11 at 06:24
0

NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.

I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.

http://postgis.refractions.net/

Matt
  • 1
  • 1
    I've worked with PostGIS. It's great too and feature packed... It depends on what you need for the task at hand. – Jordan Nov 22 '11 at 01:46
  • Thank you for the suggestion, while I do know of PostGIS, and I recently found that I greatly enjoy using PostgreSQL over MySQL, I am prefering a NoSQL solution for other reasons as well. As per the CAP theorem, RDBMs don't have great partitioning. You have to use replication and slaves and it can get very messy. While most NoSQL sacrifice Consistency for Partitioning. While others sacrifice availability for partitioning. I personally don't mind sacrificing a bit of consistency because the data I store is not crucial to be exact. And usually consistency is just a few seconds of discrepency. – Daegalus Dec 01 '11 at 22:32