3

I have a very large dataset in a mongoDB which is queried and appended to a resulting DataFrame.

for tree in db.im_tree_active.find({"date" : { '$gte' : startdate , 
'$lte' : enddate },"depth" : {'$gte' : 1, '$lte' : 4}, no_cursor_timeout = True).batch_size(1500):
    if count % 1000 == 0:
        print(count, tot)
    #keyFill(keylist, tree)  <-- added to compensate for mismatched columns
    #im = im.append(tree)  <-- ran too slowly
    im.loc[count, :] = tree  <-- runs much faster but keyFill() slows down
    count+=1

Using pandas .append() function created a copy of the dataframe, which took far too long when the DataFrame became much larger.

I replaced the append statement with a .loc[] statement which I read should speed up the query a bit, however I receive a mismatched column error. This is because some of the trees that are iterated through in the MongoDB don't have some of the fields which other ones do have. I fixed this by adding a function keyFill() given by the following simple code:

def keyFill(keylist, tree):
    for key in keylist:
        if key not in tree.keys():
            tree[key] = ""
    return tree

However running this before every single .loc[] call causes the query to slow down nearly 1000% (estimated).

Is there a way to speed this whole process up? The query runs a lot quicker before it reaches about 50% through the dataset, and then continues to slow down to the point that the last 1000 trees that it appends takes nearly 10x as long to run as the first 1000.

Jack Walsh
  • 562
  • 4
  • 14
  • 1
    Why do you fill the DataFrame in a loop? Can't you just directly construct it from find-result? In general it is very slow to fill a DataFrame row by row – CodeZero Jun 28 '18 at 16:10
  • 3
    Constantly appending to a DataFrame within a loop is inefficient. Instead, you should append the DataFrames to a list within the loop and use a single `pd.concat(list_of_dfs)` after the loop – ALollz Jun 28 '18 at 16:11
  • 2
    @ALollz your method works very quickly. However, `pd.concat()` would not work due to the trees actually being dictionaries, but simply putting `im = pd.DataFrame(list_of_dicts)` worked perfectly. Thanks! (I would mark your comment as answer but its a comment lol) – Jack Walsh Jun 28 '18 at 16:29
  • @JackWalsh Glad it worked! I went for the comment because I'd just be providing a less detailed answer than one that already exists like: https://stackoverflow.com/questions/31674557/how-to-append-rows-in-a-pandas-dataframe-in-a-for-loop – ALollz Jun 28 '18 at 16:32

0 Answers0