0

I have a collection of 6-7 million event records. I have another collection of ~100,000 hourly weather records spanning the same timeframe as the event records. I am using an $aggregate pipeline with $lookup to merge in relevant weather data for each event in the event collection.

THE PROBLEM: I have been running this on the full EVENT dataset for more than 8 HOURS, with no result. I have a deadline and I'm wondering if I will get a result...ever.

PLEASE HELP

Here is a sample event record:

{
    "_id" : ObjectId("5dedae8111cd89b173b00910"),
    "EventType" : "P",
    "Jurisdiction" : "ABCD",
    "Year" : 2006,
    "JulianDay" : 91,
    "CallReceipt" : ISODate("2006-04-01T00:00:37Z"),
    "EventClosed" : ISODate("2006-04-01T00:05:25Z"),
    "FinalType" : "EFGHI",
    "EventWindowStart" : ISODate("2006-04-01T00:00:00Z"),
    "EventWindowEnd" : ISODate("2006-04-01T01:00:00Z")
}

Here is a weather record:

{
    "_id" : ObjectId("5dc3cd909fc78c0c78a336da"),
    "DATE" : ISODate("2012-01-01T00:02:00Z"),
    "REPORT_TYPE" : "FM-16",
    "SOURCE" : 7,
    "HourlyAltimeterSetting" : "30.06",
    "HourlyDewPointTemperature" : "36",
    "HourlyDryBulbTemperature" : "37",
    "HourlyPresentWeatherType" : "BR:1 ||",
    "HourlyRelativeHumidity" : 93,
    "HourlySkyConditions" : "SCT:04 7 BKN:07 15 OVC:08 33",
    "HourlyStationPressure" : "29.46",
    "HourlyVisibility" : "5.00",
    "HourlyWetBulbTemperature" : 37,
    "HourlyWindDirection" : "260",
    "HourlyWindSpeed" : 5,
    "REM" : "MET10101/01/12 00:02:02 SPECI KROC 010502Z 26004KT 5SM BR SCT00
    7 BKN015 OVC033 03/02 A3006 RMK AO2 RTX (MP)",
    "REPORT_MODE" : "hourly"
}

Here is my code, typed directly into the mongo shell:

db.EVENTS.aggregate([
{
    $lookup: 
            {
            from: "WEATHER",
            let : { start : "$EventWindowStart", end: "$EventWindowEnd"},
            pipeline : [
                { $match :
                    { $expr: 
                        { $and:
                            [
                                {$gte: ["$DATE", "$$start"]},
                                {$lte: ["$DATE", "$$end"]}
                            ]
                        }
                    }
                },
                {$project: {
                    _id : 0,
                    HourlyDryBulbTemperature : 1,
                    HourlyPrecipitation : 1,
                    HourlyVisibility : 1,
                    WindSpeed: 1
                    }
                }
             ],
             as: "HourlyWeatherData"
            }
},
{$out: "MERGED" }

])

On a small test subset I get the desired output. So the code works, as far as I can tell...

Sample output:

{
    "_id" : ObjectId("5dedae8111cd89b173b00910"),
    "EventType" : "P",
    "Jurisdiction" : "ABCD",
    "Year" : 2006,
    "JulianDay" : 91,
    "CallReceipt" : ISODate("2006-04-01T00:00:37Z"),
    "EventClosed" : ISODate("2006-04-01T00:05:25Z"),
    "FinalType" : "EFGHI",
    "EventWindowStart" : ISODate("2006-04-01T00:00:00Z"),
    "EventWindowEnd" : ISODate("2006-04-01T01:00:00Z"),
    "HourlyWeatherData" : [
            {
                    "HourlyDryBulbTemperature" : "59",
                    "HourlyPrecipitation" : "0.00",
                    "HourlyVisibility" : "10.00"
            },
            {
                    "HourlyDryBulbTemperature" : "59",
                    "HourlyVisibility" : "9.94"
            }
    ]

}

PS: I do have ascending indexes on the event window fields in EVENTS, and an ascending and descending index on the DATE in WEATHER.

sahase
  • 3
  • 2
  • You can use `cursor.explain()` to check whether your index is really utilised by the query `db.EVENTS.aggregate(.......).explain("executionStats")` – mike Dec 09 '19 at 03:34
  • So I ran an explain("exectutionStats") on the test dataset, but it doesn't give any info about the lookup stage. Poking around this site, (https://stackoverflow.com/questions/54383148/mongodb-lookup-pipeline-using-collscan-instead-of-index) it seems that "Unfortunately query explain output does not (as at MongoDB 4.0) indicate index usage for $lookup stages. A workaround for this would be running explain using your lookup's pipeline as a top level aggregation query." <-- I'm not sure how to do that. – sahase Dec 09 '19 at 04:06

0 Answers0