2

I've been reading to use $lookup (aggregation) in MongoDB to do what I think is a simple procedure. I don't know if this is the right approach a coz I am a beginner in mongoDB. I have two collections named five_million1_1 and five_million2_1 . Both collections have different different duplicate records. I would like to combine those(article_url) duplicate records into one as well as collect other single records and want store in a single collection. I tried this and this but it's in the same collection.

Collection 1: five_million1_1.

{
    "_id" : ObjectId("5921aeadfe329210965ff3d2"),
    "article_url" : "a",
    "nyt_article_year" : 1994,
    "surface_keywords" : [
        {
            "surface_keyword" : "Greenwich",
            "entity_score" : 0.14455
        },
        {
            "surface_keyword" : "Frank Oz",
            "entity_score" : 0.60855
        }
    ]
}
{
    "_id" : ObjectId("5921aea4fe329210965ff3d1"),
    "article_url" : "b",
    "nyt_article_year" : 1995,
    "surface_keywords" : [
        {
            "surface_keyword" : "capital gain",
            "entity_score" : 0.43096
        },
        {
            "surface_keyword" : "pro forma",
            "entity_score" : 0.25205
        }
    ]
}

Collection two: five_million2_1

{
    "_id" : ObjectId("5921aeadfe329210965ff4d5"),
    "article_url" : "a",
    "nyt_article_year" : 1994,
    "surface_keywords" : [
        {
            "surface_keyword" : "dhaka",
            "entity_score" : 0.14359
        },
        {
            "surface_keyword" : "Frank",
            "entity_score" : 0.60807   
        }
    ]
}


{
    "_id" : ObjectId("5921aea4fe329210965ff3c1"),
    "article_url" : "c",
    "nyt_article_year" : 1996,
    "surface_keywords" : [
        {
            "surface_keyword" : "capital gains",
            "entity_score" : 0.43096
        },
        {
            "surface_keyword" : "pro formas",
            "entity_score" : 0.25205
        }
    ]
}

Expected result

{
    "_id" : ObjectId("5921aeadfe329210965ff3d2"),
    "article_url" : "a",
    "nyt_article_year" : 1994,
    "surface_keywords" : [
        {
            "surface_keyword" : "Greenwich",
            "entity_score" : 0.14455
        },
        {
            "surface_keyword" : "Frank Oz",
            "entity_score" : 0.60855
        },
        {
            "surface_keyword" : "dhaka",
            "entity_score" : 0.14359

        },
        {
            "surface_keyword" : "Frank",
            "entity_score" : 0.60807

        }
    ]
}

{
    "_id" : ObjectId("5921aea4fe329210965ff3d1"),
    "article_url" : "b",
    "nyt_article_year" : 1995,
    "surface_keywords" : [
        {
            "surface_keyword" : "capital gain",
            "entity_score" : 0.43096

        },
        {
            "surface_keyword" : "pro forma",
            "entity_score" : 0.25205
        }
    ]
}
{
    "_id" : ObjectId("5921aea4fe329210965ff3c1"),
    "article_url" : "c",
    "nyt_article_year" : 1996,
    "surface_keywords" : [
        {
            "surface_keyword" : "capital gains",
            "entity_score" : 0.43096

        },
        {
            "surface_keyword" : "pro formas",
            "entity_score" : 0.25205

        }
    ]
}
ekad
  • 14,436
  • 26
  • 44
  • 46
Humaun Rashid Nayan
  • 1,232
  • 14
  • 25
  • You're talking about a "UNION". A `$lookup` is not that sort of thing, and really is not even a "join" at all, even though it "looks like" a "left join". The only thing you can really do with `$lookup` is locate a document by a matching "key" ( maybe `"article_url"` ) and effectively "embed" that in the document from the parent collection you ran the aggregation from. You can do all sorts of things to "mangle" that output to appear "unionish", but none of them are really practical to use on real sized collections. – Neil Lunn May 31 '17 at 09:21
  • Thanks. trying to do according to your suggestion. But If but none of them are really practical to use on real sized collections then what could be the possible solutions? – Humaun Rashid Nayan May 31 '17 at 09:42
  • It's not a suggestion, it just simply does not work. For instance the "c" value or the "b" value does not relate to a document and therefore cannot "join". MongoDB is not made for this kind of stuff. You are meant to be doing things differently altogether rather than carry over how you work with an RDBMS. You either accept working differently, or use an RDBMS instead. So if there is some real world purpose here, i.e Removing duplicates from **one** of the collections and "writing" a single collection, then that is what you should be asking instead. – Neil Lunn May 31 '17 at 09:53

0 Answers0