0

these files that i will be getting have at least a million rows each, max 1.5 billion. The data is normalized when i get it. I need a way to store it in one document. For the most part i am not 100% how the data will be given to me. it could be csv, Fixed Width Text File or tsv or something else.

currently i have some collections that i imported from some sample csv's.

bellow is a small representation of my data missing fields

in my beneficaries.csv the data is repeated

beneficaries.csv over 6 million records

record # 1
{"userid":"a9dk4kJkj",
 "gender":"male",
 "dob":20080514,
 "start_date":20000101,
 "end_date":20080227}

record # 2
{"userid":"a9dk4kJkj",
 "gender":"male",
 "dob":20080514,
 "start_date":20080201,
 "end_date":00000000}

 same user different start and end dates

claims.csv over 200 million records

{"userid":"a9dk4kJkj",
     "date":20080514,
     "code":"d4rd3",
     "blah":"data"}

lab.csv over 10 million records

{"userid":"a9dk4kJkj",
     "date":20080514,
     "lab":"mri",
     "blah":"data"}

From my limited knowledge i have three option

  1. sort the files, read x amount in our c++ Member objects from the data files, stop at y, insert the members into mongodb, move on to starting at y for x members until we are done. This is Tested and Working but sorting such massive files will kill our machine for hours.

    1. load data to sql, read one by one into c++ Member objects, bulk load the data in mongo. Tested and works but, but i would like to avoid this very much.

    2. load the documents in mongo in seperate collections and perform a map-reduction with out parameter to wrtie to collection. I have the documents loaded (as shown above) in there own collections for each file. Unfortunately i am new to mongo and on a deadline. The map-reduction concept is difficult for me to wrap my head around and implement. I have read the docs and tried using this answer on stack overflow MongoDB: Combine data from multiple collections into one..how?

The output member collection should look like this.

{"userid":"aaa4444",
 "gender":"female",
 "dob":19901225,
 "beneficiaries":[{"start_date":20000101,
                  "end_date":20080227},
                  {"start_date":20008101,
                  "end_date":00000000}],
"claims":[{"date":20080514,
         "code":"d4rd3",
         "blah":"data"},
        {"date":20080514,
         "code":"d4rd3",
         "blah":"data"}],
"labs":[{"date":20080514,
         "lab":"mri",
         "blah":"data"}]}

Would the performance of load data to sql, read in c++ and insert into mongodb beat the map reduction? if so i will stick with that method

Community
  • 1
  • 1

1 Answers1

1

IMHO, your data are good candidates for map-reduce, hence would be better to go for option 3: load the documents in mongo in 3 seperate collections: beneficiaries, claims, labs and perform map-reduce on the userid key on each collections. Finally, integrate the data from 3 collections into single collections using find and insert on the userid key.

Assume you load beneficiaries.csv into collection beneficiaries, this is the sample code for map-reduce on beneficiaries:

mapBeneficiaries = function() {
    var values = {
        start_date: this.start_date,
        end_date: this.end_date,
        userid: this.userid,
        gender: this.gender,
        dob: this.dob
    };
    emit(this.userid, values);
};

reduce = function(k, values) {
  list = { beneficiaries: [], gender : '', dob: ''};
  for(var i in values) {
    list.beneficiaries.push({start_date: values[i].start_date, end_date: values[i].end_date});
    list.gender = values[i].gender;
    list.dob = values[i].dob;
  }
  return list;
};

db.beneficiaries.mapReduce(mapBeneficiaries, reduce, {"out": {"reduce": "mr_beneficiaries"}});

The output in mr_beneficiaries will be like this:

{
    "_id" : "a9dk4kJkj",
    "value" : {
        "beneficiaries" : [ 
            {
                "start_date" : 20080201,
                "end_date" : 0
            }, 
            {
                "start_date" : 20080201,
                "end_date" : 0
            }
        ],
        "gender" : "male",
        "dob" : 20080514
    }
}

Do the same thing to obtain mp_claims and mp_labs. Then integrate into singledocuments:

db.mr_beneficiaries.find().forEach(function(doc) {
    var id = doc._id;
    var claims = db.mr_claims.findOne({"_id":id});
    var labs = db.mr_lab.findOne({"_id":id});
    db.singledocuments.insert({"userid":id,
                         "gender":doc.value.gender,
                         "dob":doc.value.dob,
                         "beneficiaries":doc.value.beneficiaries,
                         "claims":claims.value.claims,
                         "labs":labs.value.labs});
});
anhlc
  • 13,839
  • 4
  • 33
  • 40
  • i cant use map reduction for inserting claims and lab? wouldnt that run faster – user1290942 Dec 05 '14 at 23:14
  • also i would findOne wont do it there are multiple claims and multiple lab – user1290942 Dec 06 '14 at 00:08
  • After mapreduce you will have one claims/lab list per userid so findOne is fine – anhlc Dec 06 '14 at 01:11
  • You can use mapreduce to insert claims and lab into the same document in mr_beneficiaries by setting the same "out" . But the output format won't be what you want – anhlc Dec 06 '14 at 01:15
  • I think the best solution for me would be to put every csv in one collection, then Use map reduce by id and form the new collection. Since the user id is used all acoss this should work fine. Thank you for the working beneficary example and demonstratimg how i could use map reduce on this data. I had a hard time grasping the concept from the docs until i ran your working example on the data i was familar with – user1290942 Dec 06 '14 at 05:08