1

I am cleaning a dataset, and have a field gender. In this field, there are entries such as Male, male, and MALE. To resolve this, I am trying to update my MongoDB database using pymongo.

In the database, the Gender attribute is Gender (which a capital G at the front)

My code currently looks like this:

import pymongo
from pymongo import MongoClient


db_info = {
    'db_name': 'MentalHealth',
    'collection_name': 'MentalHealth',
}

if __name__ == "__main__":

    mongo_client = MongoClient()
    mongo_db = mongo_client[db_info['db_name']]
    mongo_collection = mongo_db[db_info['collection_name']]

    #normalize to lowercase
    mongo_collection.aggregate([{ '$project': { 'Gender':{ '$toLower':"$Gender"}}}])

The code runs without issue, but the database is not updating, and I am unsure what is the error with the code. Any help would be greatly appreciated. Thank you!!!

dawg
  • 98,345
  • 23
  • 131
  • 206
Cassie H.
  • 373
  • 2
  • 7
  • 24
  • You are doing an `aggregate` which will return you all `Gender` fields cast to lower case. To update record use [update](http://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.update_one) – Sohaib Farooqi Dec 30 '17 at 04:54
  • You are almost there. You have many options. See my answer on the *possible duplicate* – styvane Dec 30 '17 at 21:39
  • @sstyvane this is wrong duplicate. OP is not updating `Gender` field using value of another field but the same field. – Sohaib Farooqi Dec 31 '17 at 05:03
  • Another field or same field, the process is still the same that is why we say "possible duplicate" I would have answered if it was not the case because none of the answers here is useful except this [answer](https://stackoverflow.com/a/48031117/3100115) which mentioned the `$out` pipeline stage operator. @GarbageCollector – styvane Dec 31 '17 at 14:29
  • _Possible Duplicate_ is when you flag a question as duplicate, marking a question as duplicate means both are exact same. Yes I do agree using `$out` pipeline operator is one way of solving this problem, however whether an answer is useful or not is to be decided by OP. @sstyvane – Sohaib Farooqi Dec 31 '17 at 14:34
  • 1
    You are missing the point here. The answers are not only for the OP. I raised that [issue once on meta](https://meta.stackoverflow.com/questions/313580/what-to-do-with-answers-that-teach-bad-practice) with answers that teach bad practice see this [comment](https://meta.stackoverflow.com/questions/313580/what-to-do-with-answers-that-teach-bad-practice#comment288104_313581) but I guess you don't want to see what I am pointing out and that is your choice @GarbageCollector – styvane Dec 31 '17 at 14:48

3 Answers3

5

Mongodb aggregation operations process data records and return computed results. It can't update any collection. you can update the same like this

db.mongo_collection.find({}).forEach(function(doc) {
    db.mongo_collection.update(
        { "_id": doc._id },
        { "$set": { "Gender": doc.Gender.toUpperCase() } }
    );
});
Murugan Perumal
  • 975
  • 5
  • 15
2

You are using aggregate query which will return you the result with all Gender fields cast to lower case. If you wish to update the value for a field you have to use update query.

Since you are using pymongo to query your documents your code should be like this

import pymongo
from pymongo import MongoClient
from bson.objectid import ObjectId

db_info = {
    'db_name': 'MentalHealth',
    'collection_name': 'MentalHealth'
}

if __name__ == "__main__":

     mongo_client = MongoClient()
     mongo_db = mongo_client[db_info['db_name']]
     mongo_collection = mongo_db[db_info['collection_name']]

     for doc in mongo_collection.find(no_cursor_timeout=True):
            pk = ObjectId(str(doc.get("_id")))
            g = doc.get('Gender')
            if g:
               g = g.lower()
               mongo_collection.update({"_id": pk}, {"$set":{"Gender":g}}) 
Murugan Perumal
  • 975
  • 5
  • 15
Sohaib Farooqi
  • 5,457
  • 4
  • 31
  • 43
  • My [comment](https://stackoverflow.com/questions/48031026/pymongo-setting-all-values-in-an-attribute-to-lowercase?noredirect=1#comment83046508_48031100) on the answer above applies to your answer as well – styvane Dec 30 '17 at 21:43
1

The aggregation framework you’re using only performs queries. To actually perform writes, you need to use a $out stage to dump the results into the collection. If you select an existing collection, that collection is replaced atomically as described in https://docs.mongodb.com/manual/reference/operator/aggregation/out/#pipe._S_out

Another option is to use an update operation to update just the documents with incorrect case.

kewne
  • 2,208
  • 15
  • 11