these files that i will be getting have at least a million rows each, max 1.5 billion. The data is normalized when i get it. I need a way to store it in one document. For the most part i am not 100% how the data will be given to me. it could be csv, Fixed Width Text File or tsv or something else.
currently i have some collections that i imported from some sample csv's.
bellow is a small representation of my data missing fields
in my beneficaries.csv the data is repeated
beneficaries.csv over 6 million records
record # 1
{"userid":"a9dk4kJkj",
"gender":"male",
"dob":20080514,
"start_date":20000101,
"end_date":20080227}
record # 2
{"userid":"a9dk4kJkj",
"gender":"male",
"dob":20080514,
"start_date":20080201,
"end_date":00000000}
same user different start and end dates
claims.csv over 200 million records
{"userid":"a9dk4kJkj",
"date":20080514,
"code":"d4rd3",
"blah":"data"}
lab.csv over 10 million records
{"userid":"a9dk4kJkj",
"date":20080514,
"lab":"mri",
"blah":"data"}
From my limited knowledge i have three option
sort the files, read x amount in our c++ Member objects from the data files, stop at y, insert the members into mongodb, move on to starting at y for x members until we are done. This is Tested and Working but sorting such massive files will kill our machine for hours.
load data to sql, read one by one into c++ Member objects, bulk load the data in mongo. Tested and works but, but i would like to avoid this very much.
load the documents in mongo in seperate collections and perform a map-reduction with out parameter to wrtie to collection. I have the documents loaded (as shown above) in there own collections for each file. Unfortunately i am new to mongo and on a deadline. The map-reduction concept is difficult for me to wrap my head around and implement. I have read the docs and tried using this answer on stack overflow MongoDB: Combine data from multiple collections into one..how?
The output member collection should look like this.
{"userid":"aaa4444",
"gender":"female",
"dob":19901225,
"beneficiaries":[{"start_date":20000101,
"end_date":20080227},
{"start_date":20008101,
"end_date":00000000}],
"claims":[{"date":20080514,
"code":"d4rd3",
"blah":"data"},
{"date":20080514,
"code":"d4rd3",
"blah":"data"}],
"labs":[{"date":20080514,
"lab":"mri",
"blah":"data"}]}
Would the performance of load data to sql, read in c++ and insert into mongodb beat the map reduction? if so i will stick with that method