1

I want to update a Mongo db collection with rows from a CSV file and I'm getting an error relating to my use of distinct. What I'm trying to do is delete a document if its id is found in the CSV file. Here's the code:

# Read in CSV file
df = pd.read_csv(csv_path)
# Create list of ID values in CSV
pdToList = list(df['id'])
# Get ids from CSV
for counter, value in enumerate(pdToList):
    # find distinct ids in collection
    result = db[coll_name].distinct("id")
    # Loop through IDs in collection
    for id_val in result:
         # Check if ID in collection equals ID in CSV
         if id_val == value:
             # Delete document if it exists in CSV                         
             db[coll_name].delete_one({'id':id_val})
         else:
             pass

This script runs fine for about 100 tables, but then I get an error message on distinct being too big: {'ok': 0.0, 'errmsg': 'distinct too big, 16mb cap', 'code': 17217, 'codeName': 'Location17217'}. Do you know how I can get a field value from a document without getting this 16mb cap error? I've read similar questions, but haven't come up with a solution.

Patty Jula
  • 255
  • 1
  • 12
  • 1
    When you use `distinct`, mongo essentially generates a document containing an array of unique values, ids in your case. 16MB is the maximum allowable size of a document. For some of your tables/collections, you might be having large no. of values, subsequently large no. of distinct values that would breach the 16MB limit. You can try the aggregation solution mentioned in https://stackoverflow.com/questions/11782566/mongodb-select-countdistinct-x-on-an-indexed-column-count-unique-results-for. Or, how about fetching all the ids and applying set operation in Python to get the distinct values? – Shiva Feb 07 '19 at 19:16
  • Instead of `distinct` I tried `result= db[coll_name].aggregate([{"$group":{"_id":"","id":{"$addToSet":"$id"}}}])`, but am getting a `{'ok': 0.0, 'errmsg': 'BSONObj size: 22416634 (0x1560CFA) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: ""', 'code': 10334, 'codeName': 'BSONObjectTooLarge'}` similar error. – Patty Jula Feb 07 '19 at 21:37

1 Answers1

1

Here is what I did. It was better to find values that were equal to the values in the dataframe list, than try to do another for loop, probably better for my computer too.

# Read in CSV file
df = pd.read_csv(csv_path)
# Create list of ID values in CSV
pdToList = list(df['id'])
# Get ids from CSV
for counter, value in enumerate(pdToList):
    # check if value is in document
    if db[coll_name].find({},{"id":{"$eq": value}}):

        # Delete document if it exists in CSV
        db[coll_name].delete_many({'id':value})

    else:
        pass
Patty Jula
  • 255
  • 1
  • 12