1

I have a Mongo database that contains an imported flat-file CSV. In SQL, this file should undoubtedly be normalized: the file contains one line per period and the periods contain information that is repetitive. I have created a query that uses the 'push' operator to aggregate (some of) the repetitive information into a single sub-object within the row. This mimics normalization. What I would like to do is restructure the output object so that the sub-object dictionary is uses keys and values on the top level. This, in SQL, is called a Pivot query or Crosstab query. In Excel it is referred to as transposition. Regardless of name, what I'm looking for is the ability to take key-value pairs and use them as 'columns' in Mongo.

Since Mongo and other NoSQL databases are aimed at denormalized implementations, I'm surprised this is so hard.

I'm trying to put the following JSON object into Mongo:

[{ "_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
    "Sub_data": [{"sub_id" : 1}, {"sub_value": 2}]  },
 { "_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
    "Sub_data": [{"sub_id" : 2}, {"sub_value": 5}]  },
 { "_id": {"Date": "1/2/2018", "Type": "Green", "client_id": 1},
    "Sub_data": [{"sub_id" : 2}, {"sub_value": 4}]  },
 { "_id": {"Date": "1/1/2018", "Type": "Orange", "client_id": 1},
    "Sub_data": [{"sub_id" : 6}, {"sub_value": 7}]  }]

And get the following out:

[{ "_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
    "1" : 2, "2":5},
 { "_id": {"Date": "1/2/2018", "Type": "Green", "client_id": 1},
    "2" : 4},
 { "_id": {"Date": "1/2/2018", "Type": "Orange", "client_id": 1},
    "6" : 7}]

Note that I want there to be an arbitrary number of columns to this result. I have looked at a number of solutions that SEEM to approach the problem (Array to object, AddFields, ReplaceRoot, Something like a pivot using static columns) and I've read multiple versions of this 'do it afterwards' code. Is post-processing the only way to do this?

NOTE: This is an attempt to mimic SQL server (and Excel and so on) functionality described in this Stack Overflow question and this TechNet article.

Rolled up, my total pipeline using the first answer's second option looks like this:

db.rate_cards.aggregate(
        {
            "$group": {
                "_id": {
                    "date": "$date",
                    "start_date": "$start_date",
                    "end_date": "$end_date"

                },
                "code_data": {
                    "$push": {
                        "code_str": {"$substr" : ["$code",0,-1]},
                        "cpm": "$cpm"
                    }
                }
            }
        },
        {
            "$group":{
                "_id":"$_id",
                "data":{
                    "$mergeObjects":{
                        "$arrayToObject":[[
                                {
                                    "k":{"$let":{"vars":{"sub_id_elem":{"$arrayElemAt":["$code_data",0]}},"in":"$$sub_id_elem.code_str"}},
                                    "v":{"$let":{"vars":{"sub_value_elem":{"$arrayElemAt":["$code_data",1]}},"in":"$$sub_value_elem.cpm"}}
                                }
                            ]]
                        }
                }
            }
        },
        {"$replaceRoot":{"newRoot":{"$mergeObjects":["$_id",{"$arrayToObject":"$data"}]}}}

 )

Note that this is a little more complicated and performance-intensive than I'd hoped. It seems to declare a local variable, use an in-clause, and so on. In attempting to run the (working) implementation of the two answers NoSQL booster chokes trying to expand row 600'ish.

A slightly edited version of the original dataset is below. Note that there are a few extra fields not used in the original query, and they have been omitted:

{
    "_id" : ObjectId("5a578d5c57d33b197004beed"),
    "date" : ISODate("2017-09-25T03:00:00.000+03:00"),
    "start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
    "end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
    "dp" : "M-Su 12m-6a",
    "dsc" : "Daypart",
    "net" : "val1",
    "place" : "loc1",
    "code" : 12,
    "cost" : 16.8
},
{
    "_id" : ObjectId("5a578d5c57d33b197004beee"),
    "date" : ISODate("2017-09-25T03:00:00.000+03:00"),
    "start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
    "end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
    "dp" : "M-Su 12m-6a",
    "dsc" : "Daypart",
    "net" : "val1",
    "place" : "loc3",
    "code" : 24,
    "cost" : 55.6
},
{
    "_id" : ObjectId("5a578d5c57d33b197004beef"),
    "date" : ISODate("2017-09-25T03:00:00.000+03:00"),
    "start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
    "end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
    "dp" : "M-Su 12n-6p",
    "dsc" : "Daypart",
    "net" : "val2",
    "place" : "loc2",
    "code" : 23,
    "cost" : 65.5
},
{
    "_id" : ObjectId("5a578d5c57d33b197004bef0"),
    "date" : ISODate("2017-09-25T03:00:00.000+03:00"),
    "start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
    "end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
    "dp" : "M-Su 6p-12m",
    "dsc" : "Daypart",
    "net" : "val2",
    "place" : "loc2",
    "code" : 23,
    "cost" : 101
}
Dylan Brams
  • 2,089
  • 1
  • 19
  • 36
  • 1
    Your input document is not a valid json. Did you intend `{ _id: {Date: "1/1/2018", Type: "Green", client_id: 1}, sub_id : 1, sub_value: 1 }` ? – s7vr Mar 19 '18 at 17:13
  • Okay. So, yes, if I had the answer for that I think I could use it to solve the more complicated problem for my particular case. However, since I'm putting reputation on it I'd love to have a more thorough answer. – Dylan Brams Mar 19 '18 at 17:35
  • Thank you. Does the Sub_data always contain two fields ? one is key and other is value ? If yes key really need to be string {"sub_id" : "1"} – s7vr Mar 19 '18 at 17:49
  • Also just realized your _ids are not unique. probably you mean `{ Date: "1/1/2018", Type: "Green", client_id: 1, "Sub_data": [{"sub_id" : "1"}, {"sub_value": 2}] }` ? – s7vr Mar 19 '18 at 17:49
  • Id's are not unique. Id's are arbitrary. – Dylan Brams Mar 19 '18 at 17:50
  • does it have a different name ? bcoz `_id` field name is reserved in mongodb as a unique key and is indexed. Something like `{ "myid": {"Date": "1/1/2018", "Type": "Green", "client_id": 1}, "Sub_data": [{"sub_id" : "1"}, {"sub_value": 2}] }` will save correctly in database. – s7vr Mar 19 '18 at 17:51
  • I guess I could project out the "sub_id" field to have a different name. I've added a few links to the equivalent functionality in SQL Server. I do most of my querying through NoSQL Booster, so my understanding of exactly what these JSON blocks look like is a little hazy. Sorry. – Dylan Brams Mar 19 '18 at 17:54
  • And sure, changing the integer fields to be strings is okay; can't that be done with a cast during the selection process? – Dylan Brams Mar 19 '18 at 17:56

1 Answers1

3

Okay based on the information provided in the post and comments I have created the below data set.

Note: I have made couple of changes. All noted in comments too.

Change the _id to read my_id in database because _id field name is reserved and is uniquely indexed.

Change the "sub_id" to store the value as string type.

db.test.insert(
[
 { "my_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
    "Sub_data": [{"sub_id" : "1"}, {"sub_value": 2}]  },
 { "my_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
    "Sub_data": [{"sub_id" : "2"}, {"sub_value": 5}]  },
 { "my_id": {"Date": "1/2/2018", "Type": "Green", "client_id": 1},
    "Sub_data": [{"sub_id" : "2"}, {"sub_value": 4}]  },
 { "my_id": {"Date": "1/1/2018", "Type": "Orange", "client_id": 1},
    "Sub_data": [{"sub_id" : "6"}, {"sub_value": 7}]  }
])

You require to use $group and $arrayToObject to output the expected format.

$group with $push to push all the values from sub data and map the first element to key and second element to value followed by $arrayToObject to format to the named key value.

$mergeObjects to merge the _id with rest of values. $replaceRoot to promote the merged doc to top level.

db.test.aggregate([
  {"$group":{
    "_id":"$my_id",
    "data":{
      "$push":{
        "k":{"$let":{"vars":{"sub_id_elem":{"$arrayElemAt":["$Sub_data",0]}},"in":"$$sub_id_elem.sub_id"}},
        "v":{"$let":{"vars":{"sub_value_elem":{"$arrayElemAt":["$Sub_data",1]}},"in":"$$sub_value_elem.sub_value"}}
      }
    }
  }},
  {"$replaceRoot":{"newRoot":{"$mergeObjects":["$_id",{"$arrayToObject":"$data"}]}}}
])

Output:

{Date:"1/2/2018", "Type":"Orange", "client_id": 1", "6":7}
{Date:"1/1/2018", "Type":"Green", "client_id": 1", "2":4}
{Date:"1/2/2018", "Type":"Green", "client_id": 1", "1":2, "2":5}

Alternately, you can use $mergeObjects as accumulator to merge objects as you group.

db.test.aggregate([
  {"$group":{
    "_id":"$my_id","data":{
      "$mergeObjects":{
        "$arrayToObject":[[
          {
            "k":{"$let":{"vars":{"sub_id_elem":{"$arrayElemAt":["$Sub_data",0]}},"in":"$$sub_id_elem.sub_id"}},
            "v":{"$let":{"vars":{"sub_value_elem":{"$arrayElemAt":["$Sub_data",1]}},"in":"$$sub_value_elem.sub_value"}}
          }
        ]]
      }
    }
  }},
  {"$replaceRoot":{"newRoot":{"$mergeObjects":["$_id","$data"]}}}
])
s7vr
  • 73,656
  • 11
  • 106
  • 127
  • Thank you. I don't have time to test it immediately, but looking at it I think it will work and I'll accept the answer as soon as I can. – Dylan Brams Mar 19 '18 at 18:13
  • Np. There is no rush here. Take your time. I just had some time in hand so I added an answer. Please feel free to comment if you have any changes or concerns and I will be more than happy to accommodate them. – s7vr Mar 19 '18 at 18:14
  • Okay, I've worked with this a bit and I'm a little mystified with my results. When I run your queries against a test database it works, but when I replace the $my_id field with a $_id field (so that this can be the second and third step in a pipeline instead of first and second) I end up with errors using both methods. Is this expected behavior? – Dylan Brams Mar 21 '18 at 16:30
  • What errors do you get ? Can I see your query ? What happens in first step ? – s7vr Mar 21 '18 at 16:33
  • Question updated. I get "$arrayToObject requires an array input, found: object" as an error for the second of the provided answers with the edited text. – Dylan Brams Mar 21 '18 at 16:46
  • The first answer, while it runs, does not provide me with multiple columns in the results. I get the key-value pair and that's it, which I find a little mystifying. Running on the original provided dataset gets me two columns in the last answer like in your results. – Dylan Brams Mar 21 '18 at 16:49
  • For the first comment I don't see why you would get that error. For the second comment I'm afraid I may not see your problem without the data.Can you also add the new data set which can reproduce the both issues ? It looks like a data issue to me. – s7vr Mar 21 '18 at 16:58
  • Yeah, of course. I apologize for making you wait but it's going to need to be tomorrow. I appreciate your help. – Dylan Brams Mar 21 '18 at 16:59
  • Okay, I edited the question text and added something more close to my real-world data. Thank you so much for your help; I'm extremely surprised that there aren't more people trying to do this kind of thing; to solve this problem perfectly I could probably apply an effective solution two or three times in this one situation. – Dylan Brams Mar 22 '18 at 09:50
  • You are welcome. You don't need the second arrayToObject operator there in your query and should be `{"$replaceRoot":{"newRoot":{"$mergeObjects":["$_id","$data"]}}}`. `$mergeObjects` and $arrayToObjects are fairly new operators and I don't know how optimize they are in terms of performance. You can devise the solution using map reduce and compare the performance. – s7vr Mar 22 '18 at 10:59