0

I am trying to design an ETL pipeline between SQL, and MongoDB. The SQL server serves as a transactional database. New/Updated data from SQL is moved to MongoDB to provide faster validation for the software.

In MongoDB, is there anything similar to a MERGE operation in SQL? I haven't been able to find an example of bulk inserting/updating thousands of mongoDB documents based on newly loaded documents related by primary key. An example of the mongoDB data would look similar to the following with dbID being a primary key in SQL:

{
    _id: 1,
    dbID: 1
    val1: xx,
    val2: aa
},
{
    _id: 2,
    dbID: 2
    val1: y,
    val2: bb
},
{

    _id: 3,
    dbID: 3
    val1: x,
    val2: cc
},
{
    _id: 4,
    dbID: 4
    val1: z,
    val2: bb
}

In the same collection, I load any changes from the transactional database to be processed. These can consist of inserts and updates. How can I bulk overwrite/insert the existing documents with their new information?

{
    _id: 5,
    dbID: 1
    val1: x,
    val2: UpdatedValue
    Type: 'ETL'
    SubType: 'Update'
},
{
    _id: 6,
    dbID: 3
    val1: x,
    val2: UpdatedValue
    Type: 'ETL'
    SubType: 'Update'
},
{
    _id: 7,
    dbID: 7
    val1: qwert,
    val2: xyz
    Type: "ETL"
    SubType: "Insert"
}

In SQL, I would join on dbID, and update/insert anything that was changed. Is there anything like this in MongoDB? Currently my solution is to loop through all Type: "ETL" records one at a time to process the necessary changes into the existing mongoDB records. It'd be great if there was a way to bulk import changes instead of looping over thousands of documents.

The final result would look like this:

{
    _id: 1,
    dbID: 1
    val1: xx,
    val2: UpdatedValue
},
{
    _id: 2,
    dbID: 2
    val1: y,
    val2: bb
},
{

    _id: 3,
    dbID: 3
    val1: x,
    val2: UpdatedValue
},
{
    _id: 4,
    dbID: 4
    val1: z,
    val2: bb
},  
{
    _id: 7,
    dbID: 7
    val1: qwert,
    val2: xyz
}

Thanks!

ehoel
  • 1
  • I ended up using pymongo replace_one with upsert=True to accomplish this task. I batched the replace_one calls in the bulk method, and sent them to the server every 250 requests. That seems reasonably fast, and satisfies my overall goal with the added bonus of fewer writes/deletes. – ehoel Jun 22 '17 at 21:25

1 Answers1

0

I ended up using pymongo replace_one with upsert=True to accomplish this task. I batched the replace_one calls in the bulk method, and sent them to the server every 250 requests. That seems reasonably fast, and satisfies my overall goal with the added bonus of fewer writes/deletes. – ehoel

ehoel
  • 1