2

I have a JSON "database" - it's a python list of JSON objects:

[{'_id': 'TRANSACTION0', 'Offer': {'From': 'merchant1', 'To': 'customer1', 'Item': 'Car', 'Price': 1000, 'Timestamp': 2}, 'Accept': {'Quantity': 1, 'Address': '123 Fake Street', 'Timestamp': 5}},
{'_id': 'TRANSACTION1', 'Offer': {'From': 'merchant1', 'To': 'customer2', 'Item': 'Computer', 'Price': 500, 'Timestamp': 5}},
{'_id': 'TRANSACTION2', 'Offer': {'From': 'merchant3', 'To': 'customer3', 'Item': 'Garbage bin', 'Price': 10, 'Timestamp': 0}, 'Accept': {'Quantity': 2, 'Address': '456 MadeUp Road', 'Timestamp': 1}},
{'_id': 'TRANSACTION3', 'Offer': {'From': 'merchant2', 'To': 'customer1', 'Item': 'Car', 'Price': 2000, 'Timestamp': 3}, 'Accept': {'Quantity': 2, 'Address': 'The White House', 'Timestamp': 3}},
{'_id': 'TRANSACTION4', 'Offer': {'From': 'merchant3', 'To': 'customer3', 'Item': 'Pens', 'Price': 2, 'Timestamp': 0}, 'Accept': {'Quantity': 4, 'Address': 'Houses of Parliment', 'Timestamp': 1}},
{'_id': 'TRANSACTION5', 'Offer': {'From': 'merchant4', 'To': 'customer1', 'Item': 'Headphones', 'Price': 200, 'Timestamp': 4}},
{'_id': 'TRANSACTION6', 'Offer': {'From': 'merchant1', 'To': 'customer2', 'Item': 'Water Bottle', 'Price': 1, 'Timestamp': 1}, 'Accept': {'Quantity': 3, 'Address': 'Timbuktu', 'Timestamp': 14}},
{'_id': 'TRANSACTION7', 'Offer': {'From': 'merchant2', 'To': 'customer3', 'Item': 'Laptop', 'Price': 900, 'Timestamp': 0}},
{'_id': 'TRANSACTION8', 'Offer': {'From': 'merchant4', 'To': 'customer1', 'Item': 'Chair', 'Price': 80, 'Timestamp': 3}, 'Accept': {'Quantity': 1, 'Address': 'Mordor', 'Timestamp': 3}},
{'_id': 'TRANSACTION9', 'Offer': {'From': 'merchant3', 'To': 'customer3', 'Item': 'Garbage bin', 'Price': 5, 'Timestamp': 2}, 'Accept': {'Quantity': 2, 'Address': 'The wall', 'Timestamp': 2}}]

My intention is to use queries, which will be stored in dictionaries, against this database. In this example, the dictionary contains:

a_dict = {"query1": "'Offer' and 'Accept'"}

Note that the dictionary will contain more queries, and also more complicated queries (e.g. (cond1 and cond2) or (cond2 and cond3)), but I need to understand why Python is doing what it's doing (and how to overcome it) as opposed to solely what the solution is.

I need to have something which evaluates and runs query1 correctly. My wrongful implementation is currently:

if (eval(a_dict["query1"]) + "in i"):

This is the same as:

if 'Offer' and 'Accept' in i:

Due to short-circuiting, this evaluates to only checking whether Accept is in i. In this example, everytime there is an Accept there is an Offer, but this may not always be the case.

A rightful if statement would be:

if 'Offer' in i and 'Accept' in i:

However, this isn't easily composable from the type of potential queries I would have. Ideally, I'd like to have an elegant solution which was "plug and play", similar to my eval if statement given above.

Is there anyway to be able take a particular query from a dictionary, plug that into an if statement, and then run that if statement as I'm intending (under the assumption that all the queries make logical sense)?

https://www.python.org/dev/peps/pep-0308/ This article says FAQ 4.16 gives alternatives, but I can't seem to find it anywhere

Nihir
  • 344
  • 2
  • 16

3 Answers3

3

Please don't use eval to do queries. This is guaranteed to blow up in your face when you don't expect it. Maybe you've heard of SQL injections; the security implications of using eval to build queries are huge.

A filter-based query system

Instead, begin by writing filter functions for common queries. This will also solve your problem and provide a "plug-and-play" way to compose queries.

Here's a pointer as to how to implement it:

Think of a query as a function which takes as arguments a few literal values (and, implicitly, a set of records), and returns a resultset of records. Ditching the list and using the set datatype for the resultset, keyed by your record id, will increase performance a lot.

Then an "AND" becomes a function which takes two (or more) sets of records and builds the set intersection of them, and an "OR" becomes a function which takes two (or more) sets of records and builds the union of them. (NOT would be the set difference between the whole set of records and one or more subsets).

If you build your functions this way, a query will become a simple tree of function calls, such as:

result = q_and(q_or(q_merchant_is('merchant2'), 
                    q_address_is('123 FakeStreet')), 
               q_quantity_above(3))

(Formatted for better legibility)

It's not that hard to write a parser for a simple query language that will build such a query, but if you don't need to provide a frontend for endusers, you might not need your own query language because the python representation of the query as seen above is simple and clear enough. And if you do need to represent your queries as dictionaries, well, if you choose a structure that closely mimicks the final structure of the query call tree, it's trivial to write a query_builder function that turns one of your dict queries into a function that will run the tree of query function calls when called.

Note: As you can see, q_merchant_is, q_quantity_above etc don't take a set of records to filter. You can fix this by making a Query class and set the full set as an instance attribute, so that each query method has access to the full recordset if it needs it:

class Query(object):
    def __init__(self, all_records):
        self.records = all_records

    def merchant_is(self, name):
        result = set()
        for record in self.records:
            if record['Offer']['From'] == name:
               result.add(record['_id'])
        return result

    def q_and(self, *args):
        result = args[0]
        for i in range(1, len(args)):
            result = args[i].intersection(result)
        return result
    ...

q = Query(my_full_record_set)
result = q.q_and(q.q_or(q.merchant_is('merchant2').........))    

Performance and indices

You see that each query function that queries for a literal value basically scans over the whole dataset to filter it. If your query contains many such searches for literal parts, you'll scan your dataset multiple times. For large datasets, this can become prohibitive.

A simple solution would be to index the fields you want to query against in one dict per field. This would speed up the query by orders of magnitude, but if your data changed, you'd need to make sure to keep the indices up to date.

Classifier query system

Another solution would be to build your query functions as classifiers instead of filters, meaning that merchant_is would take a literal value and a record and answer True or False, depending on whether the record contained that literal value in the right field. We could make this all work efficiently by having factory functions which build a composite query.

The example query from the filter section would then become:

query = q_and(q_or(q_merchant_is('merchant2'),
                   q_address_is('123 FakeStreet')),
              q_quantity_above(3))
result = perform_query(query, all_my_records)

q_merchant_is would turn into the following:

def q_merchant_is(literal):
    return lambda record: record['Orders']['From'] == literal

Note how you're returning a function that, when called with a record, will classify it.

q_or might look like this:

def q_or(*args):
    def or_template(record):
        for classifier in args:
            if classifier(record):
                return True
        return False
    return or_template

or a bit terser (I'm not sure whether this is more efficient or not):

def q_or(*args):
    return lambda record: any([ classifier(record) for classifier in args])

q_or now returns a function that runs a number of classifiers against the record passed as an argument and returns True if at least one of the classifiers returns True. q_and works just like q_or except that it only returns True if every classifier returns True. And q_not would simply return True if it's classifier returned False, and vice versa.

Now all you need is:

def perform_query(query, all_records):
    return filter(query, all_records)

This will only iterate over your dataset a single time and is pretty much as efficient as it can get in python without involving eval, compile and exec, but it's somewhat harder to understand than the filter approach.

However, this isn't easily composable from the type of potential queries I would have. Ideally, I'd like to have an elegant solution which was "plug and play"

With both the filter and the classifier systems, it is easy to extend the system with new query elements. In the filter example, you add a method to your Query class. In the classifier example, you add a query function builder like the one I wrote for q_merchant_is. Usually that involves two lines of python code.

Pascal
  • 448
  • 3
  • 11
1

There is no function or module that will automagically parse your queries the way you want. You'll have to write your own parser and implement the evaluation logic yourself.

There are modules that can help you parse the query string; for example pyparsing. If the query syntax isn't overly complex, you can probably also implement your own parser with simple string operations or perhaps with the regex module.

Whatever you end up using: Do not use eval.

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
1

I really doubt you will find a simple "plug and play" solution here. The best you could do would be to implement a proper minilanguage (parser & interpreter) for your queries.

The good news is that it might not be that difficult. If you already have working experience writing parsers and interpreters / compiler then python has no shortage of builtin and 3rd part tools, pick one and go.

Else there's an excellent python tutorial on Ruslan Pivack's blog named "let's build a simple interpreter" that will guide you thru the whole process of creating a simple Pascal parser and interpreter in Python, explaining the terminology etc.

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118