0

I'm working on the Yelp Dataset Challenge. The data is made up of large son files (up to 1 GB, 1mm+ lines). I'd like to do some data analytics on it, comparing data between files, e.g. linking a review in the review file with the business in the business file.

I have complete freedom as to what platform/programming language to use. What is the most efficient way to go about this, so I can do easy fast lookups going forward?

The son format is very straightforward. Below is an example. Fields like "user_id" are unique, and can be cross-referenced to other file entries.

{"votes": {"funny": 0, "useful": 2, "cool": 1}, 
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg", 
"review_id": "15SdjuK7DmYqUAj6rjGowg", 
"stars": 5, "date": "2007-05-17", 
"text": "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
 "type": "review", 
 "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Adam_G
  • 7,337
  • 20
  • 86
  • 148
  • I know it doesn't help you to say this, but JSON is a really poor format for datasets that large. If the structure is simple, a CSV would make more sense; if complex, the plethora of tools for dealing with XML seem daft to ignore. – IMSoP Apr 27 '15 at 22:51
  • It's not my data. It's from Yelp. Will things go faster if I convert the files to csv's? – Adam_G Apr 27 '15 at 22:58
  • Yeah, bit of an unhelpful whinge, sorry. However, treating the files as though they're already CSVs might actually be a viable approach, depending on the exact format. It might be a good idea to edit an example of the format into the question, and be as specific as possible with what you want to achieve - open-ended questions tend to be poorly received here, because they don't fit the question & answer format well, and end up with extended discussion and no "final" answer. – IMSoP Apr 27 '15 at 23:11
  • good point. I've added an example. – Adam_G Apr 27 '15 at 23:18
  • If you need persistance I would check MongoDB. You can import json directly. They support drivers for Python, Node.js, Java, C++... I use PyMongo extensively and it works like a charm. – Nuno André Dec 29 '15 at 17:10

1 Answers1

0

Start by importing all the data in a database.

You have the option of completely flattening things into multiple tables (if you get "nested" objects in the JSON), or you could keep some parts as JSON, especially if you use a database that can parse/index it (like Postgresql).

The choice of database is entirely up to you. You could use a classic SQL database (Postgresql, Mysql, SQL Server, SQLite...), or you could use a documented-oriented/noSQL database such as MongoDB (which favours JSON-like data). It's all a matter of what you will be doing with the data (and what you're comfortable with).

You can then do whatever you like with the data...

Note that if a single file is > 1 GB, you may have to use some custom tools to do the import, as loading everything at once in memory (through the usually JSON-decoding functions of your favorite language) will probably be a bit too much. Be careful, though, you still want to correctly parse all data, so avoid simplistic splits or regexes. You may want to get a look at the solutions listed in this thread: Is there a streaming API for JSON?

Community
  • 1
  • 1
jcaron
  • 17,302
  • 6
  • 32
  • 46
  • Just noticed this was an old question pushed to the top by an edit. The OP probably doesn't care anymore (especially as the challenge ends in 2 days now), but it might be helpful for others. – jcaron Dec 29 '15 at 17:11