1

I am trying to combine the data from 9 csvs that are all related to each other into one database.

The issue I'm having is the base csv has 5 million records and it needs information for the other 8 also large csvs to create one complete record, but creating this record takes over a minute.

Here is a simplified view of the problem.

The base CSV represents a Vehicle

Vehicle {
         vehicle_id,
         engine_id,
         maintenance_id
         veh_eng_maintenance_id,
}

Where maintenance_id is the primary key of a maintenance object, and there is also intermediate lookup steps.

Lookup
{
    lookup_id,
    veh_eng_maintenance_id,
    schedule_id,
}

Where schedule_id is the primary key of a schedule object from another csv and the veh_eng_maintenance_id is from the vehicle.

My goal is to create a collection in my mongo database that is made up of Vehicles

Vehicle {
         vehicle_id,
         engine_id,
         maintenance {
                      description,
                      name, 
                      }
          schedules [
                    schedule {
                              name,
                              description, 
                              date,
                             }
                    ]
          }

Right now I'm loading the csvs in using c#, creating collections for them in mongo and classes for them in c#, then I am going through the vehicles collection (all 5 million records) and querying all of the other collections to create the completed vehicle record.

But this takes way too long, and it also takes too long to query for an individual vehicle on the fly instead of building the full vehicle collection before hand. I'm wondering if there is a fast way to combine extremely large collections, or a quicker way to query.

Community
  • 1
  • 1
nastassiar
  • 1,545
  • 2
  • 24
  • 43
  • You could use python or something similar to create a script for the database and store the vehicle records with the full set of details as opposed to creating 9 different tables linked by ID's. If you really want the additional tables to exist in the database you can still build them in and use them to store unique maintenance data, etc to make creating/editing vehicle records easier. – scurrie Mar 25 '15 at 17:30
  • The issue is the data starts as 9 separate csvs that are all related, I don't have control over this, and my goal is to get them into one using c# but not taking a thousand hours. The additional tables aren't necessary I am currently only saving them to mongo so I don't run out of memory locally. – nastassiar Mar 25 '15 at 17:33
  • I like python for doing this sort of thing; you could read each csv in as a dictionary using DictReader and then use a big loop to make the database script. Since everything is in a dictionary already you can reference the maintenance data, vehicle data, etc, using the ID given in the base csv. I'm sure you could probably do the same in C# too. Take a look at [this SO post](http://stackoverflow.com/questions/5282999/reading-csv-file-and-storing-values-into-an-array), perhaps using a dictionary instead of the list objects in Michael's answer – scurrie Mar 25 '15 at 17:42
  • Probably there are a couple of things that can be improved, without seeing the code, it would be really hard to tell. However, I would advise using a CSV Reader / Mapper (e.g http://joshclose.github.io/CsvHelper/). It is very light weight and it maps the objects for you. After that, all you need to have is a final class and reference the objects and create a List. – Hozikimaru Mar 25 '15 at 17:50
  • This is basically exactly what I'm doing now the problem is it is really slow it takes over a minute to create one vehicle object because I have to query 8 different list/mongocollections to get all the information I need for one vehicle. I'm wondering if there is a faster way to join mongocollections, or lists. – nastassiar Mar 25 '15 at 17:51
  • You Could Create a script in the Mongo Shell to do this task ,do a foreach on the main file's collection then do a find by Id of the parts you need and save them to a new collection – Petrus Prinsloo Mar 25 '15 at 21:51

2 Answers2

0

While I'm not familiar with MongoDB, I would hazard a guess that the problem lies with loading way too many objects in memory. My approach would be to first create classes to handle each unique piece of data:

public class engine
{
    public int id { get; set; }
    // other things...
}

public class maintenance
{
    public int id { get; set; }
    // other things...
}

public class Vehicle
{
    public int id { get; set; } // vehicle_id
    public engine engine { get; set; }
    public maintenance maint { get; set; }
}

Then I would load the ancillary data from the csv into dictionaries using these classes:

Dictionary<int, engine> engine_list = new Dictionary<int, engine>();
Dictionary<int, maintenance> maint_list = new Dictionary<int, maintenance>();

Check out these SO posts for help populating dictionaries or lists from csv data.

Finally, you need to load data from the base csv with tons of records, construct a composite record using the id's in the base file eg engine = engine_list[id_from_csv], and store it in the db. To do this efficiently you'll have to break it up into chuncks: load 1000 or so records from the base csv, create and save the record, free your memory and process the next chunk. Check out this SO post for a nice explanation of memory usage.

Sorry I can't give you an answer specific to MongoDB; this answer assumes that you're starting from csv's and creating DB records as opposed to starting from MongoDB records and creating a new table/records. Hopefully it's helpful nonetheless. Good luck!

Community
  • 1
  • 1
scurrie
  • 403
  • 1
  • 6
  • 12
0

I found the quickest solution was to create a sql database with different tables for each type( Vehicles, Lookups etc.). When I just loaded all the data in and queried with a bunch of joins it took over a minute. But by creating foreign key references, indexes and adding primary keys to the table I got it down to less than a second. So I create the query from my c# Code and get back an object that contains all the information I need from each table.

nastassiar
  • 1,545
  • 2
  • 24
  • 43