I'm working on the Yelp Dataset Challenge. The data is made up of large son files (up to 1 GB, 1mm+ lines). I'd like to do some data analytics on it, comparing data between files, e.g. linking a review in the review file with the business in the business file.
I have complete freedom as to what platform/programming language to use. What is the most efficient way to go about this, so I can do easy fast lookups going forward?
The son format is very straightforward. Below is an example. Fields like "user_id" are unique, and can be cross-referenced to other file entries.
{"votes": {"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5, "date": "2007-05-17",
"text": "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}