I have a large BigQuery where the data is a Json dictionary at each time stamp. When a user clicks upload data on a browser, an Ajax command is initiated to tell python to download from Bigquery, crack the Json, apply various user-determined FFT's, etc and then present data back to the user for display. This is all running on Cloud Run and Flask.
My thoughts had been to cache the bulk of this activity so that "recent" timestamps would get the cracking and FFT's, but all of the old timestamps would be already prepped in the cache.
But my attempts at caching are proving slower than just using BigQuery. I'm wondering if I can improve some design decisions.
I upload the application via Docker and have ephemeral storage on /data_dir
which I guess is SSD storage. Given the transformed data, I write a Parquet file with >100,000 rows to that SSD and then I upload from the SSD to a bucket on Google Cloud Storage.
When time comes to access the cache, I download from GCS to SSD with blob.download_to_filename
and pq.read_table.to_pandas()
. When I'm done
then update the cache, use pq.write_table
to the SSD and blob.upload_from_filename
to GCS. Access to/from the SSD seems quite fast but GCS is relatively slow. 10seconds for ~20MB.
I'm confused about what are the rules for Docker container instance. I read that ephemeral files will not survive multiple instances. But could I simply abandon GCS and just use the ephemeral volume? I.e. if one Ajax call creates the thread and saves a cache on /data_dir
, can I assume that any future Ajax call will be able to access that same file? All of the Ajax calls are going to Flask master python application, if that helps.
What would other alternatives to make a cache available to future javascript calls?
Thanks, T.