-1

I have a fairly large python package that interacts synchronously with a third party API server and carries out various operations with the server. Additionally, I am now also starting to collect some of the data for future analysis by pickling the JSON responses. After profiling several serialisation/database methods, using pickle was the fastest in my case. My basic pseudo-code is:

While True:
    do_existing_api_stuff()...

    # additional data pickling
    data = {'info': []} # there are multiple keys in real version!
    if pickle_file_exists:
        data = unpickle_file()
    data['info'].append(new_data)
    pickle_data(data)
    if len(data['info']) >= 100: # file size limited for read/write speed
        create_new_pickle_file()

    # intensive section...
    # move files from "wip" (Work In Progress) dir to "complete"
    if number_of_pickle_files >= 100:
        compress_pickle_files() # with lzma
        move_compressed_files_to_another_dir()

My main issue is that the compressing and moving of the files takes several seconds to complete and is therefore slowing my main loop. What is the easiest way to call these functions in a non-blocking way without any major modifications to my existing code? I do not need any return from the function, however it will raise an error if anything fails. Another "nice to have" would be for the pickle.dump() to also be non-blocking. Again, I am not interested in the return beyond "did it raise an error?". I am aware that unpickle/append/re-pickle every loop is not particularly efficient, however it does avoid data loss when the api drops out due to connection issues, server errors, etc.

I have zero knowledge on threading, multiprocessing, asyncio, etc and after much searching, I am currently more confused than I was 2 days ago!

FYI, all of the file related functions are in a separate module/class, so that could be made asynchronous if necessary.

EDIT: There may be multiple calls to the above functions, so I guess some sort of queuing will be required?

Birchy
  • 21
  • 1
  • 1
  • 5
  • 1
    Easiest solution is probably the `threading` standard library package. This will allow you to spawn threads to do the pickling while your main loop continues. However, if you are already maxing out your disk i/o this won't actually make the program as a whole any faster. You'll just end up with the main loop finishing earlier and the data buffered in memory until the disk operations complete. Having said that, there is almost certainly quite a bit of 'dead time' in your existing loop waiting for the API to respond so you could gain a fair bit. – Simon Notley Jun 28 '20 at 21:52
  • @SimonN is Pickle thread-safe? The API is often sat waiting for a server response, however the intensive stuff then blocks the loop and reduces the API requests unnecessarily. – Birchy Jun 28 '20 at 21:58
  • This is a bit broad, are there not already plenty of resources on threading/asynchronous programming in Python? – AMC Jun 28 '20 at 22:10
  • 1
    @Birchy: Don't put the pickling in a thread (it's likely a trivial part of your runtime anyway). LZMA compression takes *forever* and can run in a background thread just fine. – ShadowRanger Jun 28 '20 at 22:11
  • I think you should check this it might help you: https://stackoverflow.com/questions/52786328/run-two-async-functions-without-blocking-each-other – MrRedbloX Jun 28 '20 at 22:18

1 Answers1

0

Easiest solution is probably the threading standard library package. This will allow you to spawn a thread to do the compression while your main loop continues.

There is almost certainly quite a bit of 'dead time' in your existing loop waiting for the API to respond and conversely there is quite a bit of time spent doing the compression when you could be usefully making another API call. For this reason I'd suggest separating these two aspects. There are lots of good tutorials on threading so I'll just describe a pattern which you could aim for

  • Keep the API call and the pickling in the main loop but add a step which passes the file path to each pickle to a queue after it is written

  • Write a function which takes a the queue as its input and works through the filepaths performing the compression

  • Before starting the main loop, start a thread with the new function as its target

Simon Notley
  • 2,070
  • 3
  • 12
  • 18