I have a very large dataset in a mongoDB which is queried and appended to a resulting DataFrame.
for tree in db.im_tree_active.find({"date" : { '$gte' : startdate ,
'$lte' : enddate },"depth" : {'$gte' : 1, '$lte' : 4}, no_cursor_timeout = True).batch_size(1500):
if count % 1000 == 0:
print(count, tot)
#keyFill(keylist, tree) <-- added to compensate for mismatched columns
#im = im.append(tree) <-- ran too slowly
im.loc[count, :] = tree <-- runs much faster but keyFill() slows down
count+=1
Using pandas .append()
function created a copy of the dataframe, which took far too long when the DataFrame became much larger.
I replaced the append statement with a .loc[]
statement which I read should speed up the query a bit, however I receive a mismatched column error. This is because some of the trees
that are iterated through in the MongoDB don't have some of the fields which other ones do have. I fixed this by adding a function keyFill()
given by the following simple code:
def keyFill(keylist, tree):
for key in keylist:
if key not in tree.keys():
tree[key] = ""
return tree
However running this before every single .loc[]
call causes the query to slow down nearly 1000% (estimated).
Is there a way to speed this whole process up? The query runs a lot quicker before it reaches about 50% through the dataset, and then continues to slow down to the point that the last 1000 trees that it appends takes nearly 10x as long to run as the first 1000.