2

I'm currently working on a project that involves visualizing an ever growing database of over 100m entries (tweets), and running into some bottlenecks in Python that I'm not sure how to face.

Some details:

  1. The database is indexed on all fields I am querying on, including time, and text.

  2. Each entry in the collection contains a very large and complex structure, with nearly 100 nested fields.

  3. I am only projecting a small number of columns, as the visualization only requires a fraction of the stored data.

  4. Fields queried for are of types String, float32/64, date, and id.

When querying this data for a given date range within the Mongo shell, the processing times are more than acceptable, however, any queries made within Python take exponentially longer. While I think I have a decent understanding of WHY this happens, I don't have enough knowledge on the matter to find a solution.

I have used both PyMongo and Monary, both with disappointing results.

Are there any obvious solutions to get the processing time within Python closer to the time in the Mongo shell? Idea's I have thought of include having Mongo save the query results into a separate collection before transferring to Python, and trying a Javascript based solution instead of Python/Pandas.

This query (over a period of 10 seconds) using Monary returns 2878 rows, and takes 76 seconds.

    start_time = datetime.datetime.strptime('2017-09-09 00:00:00', '%Y-%m-%d 
    %H:%M:%S').replace(
        tzinfo=timeZone).astimezone(tz.tzutc())
    end_time = datetime.datetime.strptime('2017-09-09 00:10:00', '%Y-%m-%d 
    %H:%M:%S').replace(tzinfo=timeZone).astimezone(
        tz.tzutc())

    columns = ['created_at']
    type = ['date']

    arrays = mon.query(
            'streamingTwitterDB',
            'streamingTwitterDB',
            {'created_at': {'$gt': start_time, '$lte': end_time}},
            columns,
            type
        )

    df = numpy.matrix(arrays).transpose()
    df = pd.DataFrame(df, columns=columns)

In mongo I can query over an hour almost instantaneously.

Kyle Robinson
  • 45
  • 2
  • 7

1 Answers1

1

Try our prototype BSON-NumPy library. It avoids the overhead of PyMongo (which must translate documents to dicts before you translate them to rows in a NumPy array), and also the overhead of Monary (which slows down on large documents due to an n-squared algorithm that matches field names to NumPy columns). If you have issues please let us know in the BSON-NumPy issue tracker.

A. Jesse Jiryu Davis
  • 23,641
  • 4
  • 57
  • 70
  • 1
    This module looks pretty promising! Did you do any timing comparisons? – MaxU - stand with Ukraine Sep 11 '17 at 22:15
  • Yes, it's faster. It avoids the overhead of PyMongo (which must translate documents to dicts before you translate them to rows in a NumPy array), and also the overhead of Monary (which slows down on large documents due to an n-squared algorithm that matches field names to NumPy columns). Please give it a try and let us know how it works for you! – A. Jesse Jiryu Davis Sep 12 '17 at 03:30
  • Does this library allow for querying on fields within a collection before converting? I can only find examples where data is projected or converted in its entirety. – Kyle Robinson Sep 12 '17 at 18:56
  • Yes. I've added an example that uses projection and filter to the BSON-NumPy documentation: http://bson-numpy.readthedocs.io/ – A. Jesse Jiryu Davis Sep 13 '17 at 13:20