1

I have created a Pandas dataframe:

scores = pd.DataFrame(
        {"batch_size" : list(range(64)),
         "learning_rate" : list(range(64)),
         "dropout_rate" : list(range(64)),
         "accuracies" : [[0]]*64,
         "loss" : [[0]]*64,
         "training_time" : list(range(64)),
         }, index = list(range(64)))

Then, in a loop I run 64 models and add the results in the list.

The loop is still ongoing and I dont expect it to be finished before my deadline. Therefore, I would like to terminate the console and continue with the information that has been stored in scores so far. However, I only want to do this if I can still access the dataframe after I terminate the loop.

Can I use the dataframe with intermediate results if I terminate the loop while it's still running?

Emil
  • 1,531
  • 3
  • 22
  • 47
  • How are you planning on terminating the loop? Are you saving the DF to a temp file or something while the loop is running, or is it just in memory? How do you plan on accessing the DF later? – MattDMo Jul 07 '20 at 17:57
  • Its just in memory now. I would like to export it to csv afterwards – Emil Jul 07 '20 at 18:01
  • You should have a look to `pandas.to_csv` function : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html – pyOliv Jul 07 '20 at 20:24
  • Store a pointer to what models are executed (1,2,3..) and save the results by one, assemble dataframe after you have all the results? – Evgeny Jul 07 '20 at 21:37

1 Answers1

1
  1. If possible, I would prioritize pandas methods rather than using for loops, as this would solve the core problem. Even better, if you are able to change the for loops to pandas methods, and you want even faster execution, then many pandas methods can also be used by a big data python library called dask. That is a little bit more advanced, but I was in a similar position for a large project and dask was a great solution, but it took a day or so to get used to the library and transform my code from pandas to dask.

  2. If you just want to keep your code as is and do this in pandas, then I would look into separating the dataframe into chunks if it is still taking forever to process:

    n = 100000
    scores_df_list = [scores[i:i+n] for i in range(0,scores.shape[0],n)]
    i=0
    for df in scores_df_list:
        i+=1
        #inefficient for loop code on large dataset...
        #inefficient for loop code on large dataset continued...
        df.to_csv(f'file{i}.csv')
    

See more here from the answer by @ScottBoston and kindly upvote his solution if helpful: Pandas - Slice Large Dataframe in Chunks:

David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • This is not exaclty an answer to my question, but thanks for the suggestion as I was not aware of the `dask` library. Also, I was doing a grid search and stored the intermediate results in my dataframe. So I think a for loop was the best approach here. – Emil Jul 09 '20 at 12:11