I'm currently working on a project that involves visualizing an ever growing database of over 100m entries (tweets), and running into some bottlenecks in Python that I'm not sure how to face.
Some details:
The database is indexed on all fields I am querying on, including time, and text.
Each entry in the collection contains a very large and complex structure, with nearly 100 nested fields.
I am only projecting a small number of columns, as the visualization only requires a fraction of the stored data.
Fields queried for are of types String, float32/64, date, and id.
When querying this data for a given date range within the Mongo shell, the processing times are more than acceptable, however, any queries made within Python take exponentially longer. While I think I have a decent understanding of WHY this happens, I don't have enough knowledge on the matter to find a solution.
I have used both PyMongo and Monary, both with disappointing results.
Are there any obvious solutions to get the processing time within Python closer to the time in the Mongo shell? Idea's I have thought of include having Mongo save the query results into a separate collection before transferring to Python, and trying a Javascript based solution instead of Python/Pandas.
This query (over a period of 10 seconds) using Monary returns 2878 rows, and takes 76 seconds.
start_time = datetime.datetime.strptime('2017-09-09 00:00:00', '%Y-%m-%d
%H:%M:%S').replace(
tzinfo=timeZone).astimezone(tz.tzutc())
end_time = datetime.datetime.strptime('2017-09-09 00:10:00', '%Y-%m-%d
%H:%M:%S').replace(tzinfo=timeZone).astimezone(
tz.tzutc())
columns = ['created_at']
type = ['date']
arrays = mon.query(
'streamingTwitterDB',
'streamingTwitterDB',
{'created_at': {'$gt': start_time, '$lte': end_time}},
columns,
type
)
df = numpy.matrix(arrays).transpose()
df = pd.DataFrame(df, columns=columns)
In mongo I can query over an hour almost instantaneously.