10

How can I remove all fields that are null from all documents of a given collection?


I have a collection of documents such as:

{
    'property1': 'value1',
    'property2': 'value2',
    ...
}

but each document may have a null entry instead of a value entry.

I would like to save disk space by removing all null entries. The existence of the null entries does not contain any information in my case because I know the format of the JSON document a priori.

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
Till Hoffmann
  • 9,479
  • 6
  • 46
  • 64

6 Answers6

15

Starting Mongo 4.2, db.collection.update() can accept an aggregation pipeline, finally allowing the removal of a field based on its value:

// { _id: ObjectId("5d0e8...d2"), property1: "value1", property2: "value2" }
// { _id: ObjectId("5d0e8...d3"), property1: "value1", property2: null, property3: "value3" }
db.collection.update(
  {},
  [{ $replaceWith: {
    $arrayToObject: {
      $filter: {
        input: { $objectToArray: "$$ROOT" },
        as: "item",
        cond: { $ne: ["$$item.v", null] }
      }
    }
  }}],
  { multi: true }
)
// { _id: ObjectId("5d0e8...d2"), property1: "value1", property2: "value2" }
// { _id: ObjectId("5d0e8...d3"), property1: "value1", property3: "value3" }

In details:

  • The first part {} is the match query, filtering which documents to update (in our case all documents).

  • The second part [{ $replaceWith: { ... }] is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline):

    • With $objectToArray, we first transform the document to an array of key/values such as [{ k: "property1", v: "value1" }, { k: "property2", v: null }, ...].
    • With $filter, we filter this array of key/values by removing items for which v is null.
    • We then transform back the filtered array of key/values to an object using $arrayToObject.
    • Finally, we replace the whole document by the modified one with $replaceWith.
  • Don't forget { multi: true }, otherwise only the first matching document will be updated.

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
  • 2
    This should be the accepted answer. It is structure-agnostic, and works by simply replacing the collection name. Only caveat would be that it works only the top level keys, and not the nested ones, if you have them. – sandboxbohemian Jul 09 '21 at 17:08
12
// run in mongo shell  

var coll = db.getCollection("collectionName");
var cursor = coll.find();
while (cursor.hasNext()) {
    var doc = cursor.next();
    var keys = {};
    var hasNull = false;
    for ( var x in doc) {
        if (x != "_id" && doc[x] == null) {
            keys[x] = 1;
            hasNull = true;
        }
    }
    if (hasNull) {
        coll.update({_id: doc._id}, {$unset:keys});
    }
}
SethO
  • 2,703
  • 5
  • 28
  • 38
Wizard
  • 129
  • 2
3

You can use the mongo updateMany functionality, but you must do this by specifying the parameter you are going to update, such as the year parameter:

db.collection.updateMany({year: null}, { $unset : { year : 1 }})
J.C. Gras
  • 4,934
  • 1
  • 37
  • 44
3

This is an important question since mongodb cannot index null values (i.e. do not query for nulls or you will be waiting for a long time), so it is best to entirely avoid nulls and set default values using setOnInsert.

Here is a recursive solution to removing nulls:

/**
 * RETRIEVES A LIST OF ALL THE KEYS IN A DOCUMENT, WHERE THE VALUE IS 'NULL' OR 'UNDEFINED'
 *
 * @param doc
 * @param keyName
 * @param nullKeys
 */
function getNullKeysRecursively(doc, keyName, nullKeys)
{
    for (var item_property in doc)
    {
        // SKIP BASE-CLASS STUFF
        if (!doc.hasOwnProperty(item_property))
            continue;
        // SKIP ID FIELD
        if (item_property === "_id")
            continue;

        // FULL KEY NAME (FOR SUB-DOCUMENTS)
        var fullKeyName;
        if (keyName)
            fullKeyName = keyName + "." + item_property;
        else
            fullKeyName = item_property;

        // DEBUGGING
        // print("fullKeyName: " + fullKeyName);

        // NULL FIELDS - MODIFY THIS BLOCK TO ADD CONSTRAINTS
        if (doc[item_property] === null || doc[item_property] === undefined)
            nullKeys[fullKeyName] = 1;

        // RECURSE OBJECTS / ARRAYS
        else if (doc[item_property] instanceof Object || doc[item_property] instanceof Array)
            getNullKeysRecursively(doc[item_property], fullKeyName, nullKeys);
    }
}

/**
 * REMOVES ALL PROPERTIES WITH A VALUE OF 'NULL' OR 'UNDEFINED'.
 * TUNE THE 'LIMIT' VARIABLE TO YOUR MEMORY AVAILABILITY.
 * ONLY CLEANS DOCUMENTS THAT REQUIRE CLEANING, FOR EFFICIENCY.
 * USES bulkWrite FOR EFFICIENCY.
 *
 * @param collectionName
 */
function removeNulls(collectionName)
{
    var coll = db.getCollection(collectionName);
    var lastId = ObjectId("000000000000000000000000");
    var LIMIT = 10000;
    while (true)
    {
        // GET THE NEXT PAGE OF DOCUMENTS
        var page = coll.find({ _id: { $gt: lastId } }).limit(LIMIT);
        if (! page.hasNext())
            break;

        // BUILD BULK OPERATION
        var arrBulkOps = [];
        page.forEach(function(item_doc)
        {
            lastId = item_doc._id;

            var nullKeys = {};
            getNullKeysRecursively(item_doc, null, nullKeys);

            // ONLY UPDATE MODIFIED DOCUMENTS
            if (Object.keys(nullKeys).length > 0)
            // UNSET INDIVIDUAL FIELDS, RATHER THAN REWRITE THE ENTIRE DOC
                arrBulkOps.push(
                    { updateOne: {
                            "filter": { _id: item_doc._id },
                            "update": { $unset: nullKeys }
                        } }
                );
        });

        coll.bulkWrite(arrBulkOps, { ordered: false } );
    }
}

// GO GO GO
removeNulls('my_collection');

document before:

{
    "_id": ObjectId("5a53ed8f6f7c4d95579cb87c"),
    "first_name": null,
    "last_name": "smith",
    "features": {
        "first": {
            "a": 1,
            "b": 2,
            "c": null
        },
        "second": null,
        "third" : {},
        "fourth" : []
    },
    "other": [ 
        null, 
        123, 
        {
            "a": 1,
            "b": "hey",
            "c": null
        }
    ]
}

document after:

{
    "_id" : ObjectId("5a53ed8f6f7c4d95579cb87c"),
    "last_name" : "smith",
    "features" : {
        "first" : {
            "a" : 1,
            "b" : 2
        }
    },
    "other" : [ 
        null, 
        123, 
        {
            "a" : 1,
            "b" : "hey"
        }
    ]
}

As you can see, it removes null, undefined, empty objects and empty arrays. If you need it to be more/less aggressive, it is a matter of modifying the block "NULL FIELDS - MODIFY THIS BLOCK TO ADD CONSTRAINTS".

edits welcome, especially @stennie

mils
  • 1,878
  • 2
  • 21
  • 42
1

Like this question mentioned (mongodb query without field name):

Unfortunately, MongoDB does not support any method of querying all fields with a particular value.

So, you can either iterate the document (like Wizard's example) or do it in non-mongodb way.

If this is a JSON file, remove all the lines with null in sed might works:

sed '/null/d' ./mydata.json
Community
  • 1
  • 1
wilbeibi
  • 3,403
  • 4
  • 25
  • 44
0

Update for 2022: If you delete keys with values Null, [], "", {} from the DB, that won't reduce it's size on disk. You need to do that before you upload data into the collection.

Tested it myself. I had 6.000.000 documents in collection. Ran script of Xavier Guihot. Before script it was 7.8GB, after the script it became 7.9GB. I confirm, that script do the job and remove keys, it's just that it doesn't reduce the size of the DB space allocation.

Then I deleted completely the collection, and imported .json dumps, that had already been formatted (removed all keys with values Null, [], "", {}). After collection size was 6.1GB That's minus 22% of the original size.


Here is python script I used to remove all empty keys from json dumps:

import fileinput
import json

for line in fileinput.input(inplace=1):
    j = {k:v for k, v in json.loads(line).items() if v}
    print(line.replace(line, json.dumps(j)))

Just run the script with file name as argument, for example: python3 main.py dump-00001


ps: take into account, that you need to wait ~200 seconds after changes to DB, because WiredTiger keep history backup of data for consistency after you make changes. That's mean, that only after 200 sec you will see real storage allocation of DB. 200 sec is default value for that action.