3

I have a mongodb collection against which I need to run many count operations (each with a different query) every hour. When I first set this up, the collection was small, and these count operations ran in approx one minute, which was acceptable. Now they take approx 55 minutes, so they're running nearly continuously.

The query associated with each count operation is rather involved, and I don't think there's a way to get them all to run with indices (i.e. as COUNT_SCAN operations).

The only feasible solution I've come up with is to:

  • Run a full collection scan every hour, pulling every document out of the db
  • Once each document is in memory, run all of the count operations against it myself

Without my solution the server is running dozens and dozens of full collection scans each hour. With my solution the server is only running one. This has led me to a strange place where I need to take my complex queries and re-implement them myself so I can come up with my own counts every hour.

So my question is whether there's any support from mongo drivers (pymongo in my case, but I'm curious in general) in interpreting query documents but running them locally against data in memory, not against data on the mongodb server.

Initially this felt like an odd request, but there's actually quite a few places where this approach would probably greatly lessen the load on the database in my particular use case. So I wonder if it comes up from time to time in other production deployments.

wowkin2
  • 5,895
  • 5
  • 23
  • 66
nonagon
  • 3,271
  • 1
  • 29
  • 42
  • 1
    I don't believe so. MongoDB treats everything as "data" equally, whether it's in memory or otherwise. As your data grows, doing a regular collection scan will only get slower and slower. Can you give some example of the query involved? How about using a pre-aggregated report pattern (https://docs.mongodb.com/ecosystem/use-cases/pre-aggregated-reports-mmapv1/)? Note that although the linked implementation mentions MMAPv1 only, the pattern will apply equally well to WiredTiger. – kevinadi Jan 30 '18 at 03:45
  • There's a lot of different queries but one example is a field that can be missing, null or have a value. I want to count instances where it's missing, null or has one specific values (but omit from the count all other values). I don't think I can cover than count query with an index. – nonagon Jan 30 '18 at 14:47
  • The pre-aggregated report pattern might be helpful in deciding how to store the counts I'm generating, but that's a pretty small amount of data anyway. I really just need to figure out how to perform the count operations in an efficient way. – nonagon Jan 30 '18 at 14:48
  • 2
    1) Do you use indexes? 2) Can you provide simplified examples of data and queries that can reproduce this issue? I had a similar problem previously, but the query was not really efficient. After I simplified query and added indexes - everything become much faster. – wowkin2 Aug 28 '18 at 15:43
  • @wowkin2 I do use indexing but my question isn't about how to get the queries to run faster. I'm saying that running the queries efficiently is out of the question, and I'm asking about a way to run the queries locally (outside the db). – nonagon Aug 28 '18 at 19:03

4 Answers4

1

MongoDB In-Memory storage engine

If you want to process data using complex queries only in RAM using MongoDB syntax, you can configure MongoDB to use In-Memory only storage engine that avoids disk I/O at all.
For me, it is the best option to have the ability to have complex queries and best performance.

Python in-memory databases:

You can use one of the following:

  • PyDbLite - a fast, pure-Python, untyped, in-memory database engine, using Python syntax to manage data, instead of SQL
  • TinyDB - if you need a simple database with a clean API that just works without lots of configuration, TinyDB might be the right choice for you. But not a fast solution and have few other disadvantages.

They should allow working with data directly in RAM, but I'm not sure if this is better than the previous option.

Own custom solution (e.g. written in Python)

Some services handle data in RAM only on application level only. If your solution is not complicated and queries are simple - this is ok. But since some time queries become more complicated and code require some abstraction level (for advanced CRUD), like previous databases.

The last solution can have the best performance, but it takes more time to develop and support it.

wowkin2
  • 5,895
  • 5
  • 23
  • 66
  • My question isn't about having mongo interpret the queries - I need to interpret the queries locally using code. So instead of passing the query document to pymongo which passes it to the server, I need to match the query document against the document in Python directly. – nonagon Aug 29 '18 at 02:30
  • You can do that using Python directly, but it is not related to pymongo or MongoDB in that case. – wowkin2 Aug 29 '18 at 10:59
  • @nonagon, added examples of Python in-memory database that you can investigate and use in your project or even develop your own. But all these options have different pros and cons :) – wowkin2 Aug 29 '18 at 11:57
  • Sorry for the confusion. I have a large mongodb production deployment, and 99% of the time I issue queries just like everyone else. I have a special use case where it really does make sense to fetch all of the documents from a large collection and process them locally. So I am married to the mongo query document syntax, but I need to interpret those query documents manually in Python code for this one special use case. So my question isn't about switching databases, it's about interpreting mongo query docs manually in Python code. – nonagon Aug 29 '18 at 12:55
  • I added an answer with the code I'm currently using to solve this problem. I thought this might help clarify what I'm looking for. It felt strange to write that code from scratch since I think my use case isn't actually that strange, and there might be a library which can help with this sort of thing. – nonagon Aug 29 '18 at 12:59
  • From your example, I don't see any complex features used in queries. And I think that in-memory storage with good indexes will have almost the same performance as in Python. If you can provide MongoDB query that you use - I can investigate it. But if the current solution works fine for you - it is ok to use it. You can even share it somewhere on GitHub to port all query-types. – wowkin2 Aug 29 '18 at 13:41
1

As you are using python, have you considered Pandas?, you could basically try and transform your JSON data to pandas data frame and query it as you like, you could achieve whole bunch of operations like count, group by, aggregate etc. Please take a look at the doc. Adding a small example below to help you relate. Hope this helps.

For example:

import pandas as pd
from pandas.io.json import json_normalize
data = {
"data_points":[
    {"a":1,"b":3,"c":2},
    {"a":3,"b":2,"c":1},
    {"a":5,"b":4,"d":3}
   ]
}
# convert json to data frame
df = json_normalize(data["data_points"])

enter image description here

Pandas data frame view above.

now you could just try and perform operation on them like sum, count etc.

Example:

# sum of column `a`
df['a'].sum()

output: 9

# sum of column `c` that has null values.
df['c'].sum()

output: 3.0

# count of column `c` that has null values.
df['c'].count()

output: 2
AJS
  • 1,993
  • 16
  • 26
  • Thanks for the suggestion, but my question is really about how to interpret mongo queries against documents stored locally in memory – nonagon Sep 04 '18 at 15:17
  • agreed, but just a thought why do you want to query this data using Mongo type queries? As in the end its just a JSON data you can use other ways to do the same work. Something like this or maybe something else out there. i think if you write your custom solution the performance may not the same as other solutions out there. – AJS Sep 05 '18 at 08:53
  • In my case I do lots of things with the mongo queries - query the server, modify the queries, query the server again, etc etc. so I'd like to keep everything in terms of mongo queries for consistency – nonagon Sep 05 '18 at 13:55
0

Here's the code I have currently to solve this problem. I have enough tests running against it to qualify it for my use case, but it's probably not 100% correct. I certainly don't handle all possible query documents.

def check_doc_against_mongo_query(doc, query):
    """Return whether the given doc would be returned by the given query.

    Initially this might seem like work the db should be doing, but consider a use case where we
    need to run many complex queries regularly to count matches. If each query results in a full-
    collection scan, it is often faster to run a single scan fetching the entire collection into
    memory, then run all of the matches locally.

    We don't support mongo's full query syntax here, so we'll need to add support as the need
    arises."""

    # Run our check recursively
    return _match_query(doc, query)


def _match_query(doc, query):
    """Return whether the given doc matches the given query."""

    # We don't expect a null query
    assert query is not None

    # Check each top-level field for a match, we AND them together, so return on mismatch
    for k, v in query.items():
        # Check for AND/OR operators
        if k == Mongo.AND:
            if not all(_match_query(doc, x) for x in v):
                return False
        elif k == Mongo.OR:
            if not any(_match_query(doc, x) for x in v):
                return False
        elif k == Mongo.COMMENT:
            # Ignore comments
            pass
        else:
            # Now grab the doc's value and match it against the given query value
            doc_v = nested_dict_get(doc, k)
            if not _match_doc_and_query_value(doc_v, v):
                return False

    # All top-level fields matched so return match
    return True


def _match_doc_and_query_value(doc_v, query_v):
    """Return whether the given doc and query values match."""

    cmps = []  # we AND these together below, trailing bool for negation

    # Check for operators
    if isinstance(query_v, Mapping):
        # To handle 'in' we use a tuple, otherwise we use an operator and a value
        for k, v in query_v.items():
            if k == Mongo.IN:
                cmps.append((operator.eq, tuple(v), False))
            elif k == Mongo.NIN:
                cmps.append((operator.eq, tuple(v), True))
            else:
                op = {Mongo.EQ: operator.eq, Mongo.GT: operator.gt, Mongo.GTE: operator.ge,
                      Mongo.LT: operator.lt, Mongo.LTE: operator.le, Mongo.NE: operator.ne}[
                          k]
                cmps.append((op, v, False))
    else:
        # We expect a simple value here, perform an equality check
        cmps.append((operator.eq, query_v, False))

    # Now perform each comparison
    return all(_invert(_match_cmp(op, doc_v, v), invert) for op, v, invert in cmps)


def _invert(result, invert):
    """Invert the given result if necessary."""

    return not result if invert else result


def _match_cmp(op, doc_v, v):
    """Return whether the given values match with the given comparison operator.

    If v is a tuple then we require op to match with any element.

    We take care to handle comparisons with null the same way mongo does, i.e. only null ==/<=/>=
    null returns true, all other comps with null return false. See:
    https://stackoverflow.com/questions/29835829/mongodb-comparison-operators-with-null
    for details.

    As an important special case of null comparisons, ne null matches any non-null value.
    """

    if doc_v is None and v is None:
        return op in (operator.eq, operator.ge, operator.le)
    elif op is operator.ne and v is None:
        return doc_v is not None
    elif v is None:
        return False
    elif isinstance(v, tuple):
        return any(op(doc_v, x) for x in v)
    else:
        return op(doc_v, v)
nonagon
  • 3,271
  • 1
  • 29
  • 42
  • I don't want to accept this answer because I think there must be a better way. I thought this non-optimal answer would help clarify my question. – nonagon Aug 29 '18 at 12:58
0

Maybe you could try another approach? I mean, MongoDB performs really bad in counting, overall with big collections.

I had a pretty similar problem in my last company and what we did is to create some "counters" object, and update them in every update you perform over your data. In this way, you avoid counting at all.

The document would be something like:

{
query1count: 12,
query2count: 512312,
query3count: 6
}

If the query1count is related to the query: "all documents where userId = 13", then in your python layer you can check before creating/updating a document if the userId = 13, and if so then increase the desired counter.

It will do add a lot of extra complexity to your code, but the reads of the counters will be performed in O(1).

Of course, not all the queries may be that easy but you can reduce a lot the execution time with this approach.

Juan Bermudez
  • 430
  • 3
  • 6
  • Agreed that in theory I could perform the counting incrementally as you describe. In my case I think interpreting the queries locally will be much easier to maintain. – nonagon Sep 04 '18 at 15:18