0

I have a large BigQuery where the data is a Json dictionary at each time stamp. When a user clicks upload data on a browser, an Ajax command is initiated to tell python to download from Bigquery, crack the Json, apply various user-determined FFT's, etc and then present data back to the user for display. This is all running on Cloud Run and Flask.

My thoughts had been to cache the bulk of this activity so that "recent" timestamps would get the cracking and FFT's, but all of the old timestamps would be already prepped in the cache.

But my attempts at caching are proving slower than just using BigQuery. I'm wondering if I can improve some design decisions.

I upload the application via Docker and have ephemeral storage on /data_dir which I guess is SSD storage. Given the transformed data, I write a Parquet file with >100,000 rows to that SSD and then I upload from the SSD to a bucket on Google Cloud Storage.

When time comes to access the cache, I download from GCS to SSD with blob.download_to_filename and pq.read_table.to_pandas(). When I'm done then update the cache, use pq.write_table to the SSD and blob.upload_from_filename to GCS. Access to/from the SSD seems quite fast but GCS is relatively slow. 10seconds for ~20MB.

I'm confused about what are the rules for Docker container instance. I read that ephemeral files will not survive multiple instances. But could I simply abandon GCS and just use the ephemeral volume? I.e. if one Ajax call creates the thread and saves a cache on /data_dir, can I assume that any future Ajax call will be able to access that same file? All of the Ajax calls are going to Flask master python application, if that helps.

What would other alternatives to make a cache available to future javascript calls?

Thanks, T.

Tunneller
  • 381
  • 2
  • 13
  • 1
    The file system in Cloud Run is in `memory`. That means when the instance is terminated, the cache is lost. If your instance is not servicing requests it is eligible to be terminated. While the instance is running, each request to the same instance can access the shared cache. The memory is not shared across multiple instances. I would not rely upon caching data within an instance. – John Hanley Apr 25 '23 at 23:39
  • That is indeed my concern. So what exactly is an "instance"... There is a Flask program sitting in the background (I guess) and then when requests from one browser a thread is fired up. I think that thread is within the same instance. Requests from Ajax, different browsers, etc (I guess) are really just PUT and GET as far as Flask is concerned, so I think all of these requests are all the same instance. When/how does that instance die? – Tunneller Apr 26 '23 at 00:08
  • 1
    I recommend reading about the lifecycle of a Cloud Run instance. Basically, an instance exists while it services HTTP requests, thereafter it is terminated at Google's discretion. There are options to control that behavior. The busier your service, the more instances are launched. That is controlled by requests per second, CPU utilization, etc. – John Hanley Apr 26 '23 at 01:38
  • Oh, I see. Yes, you are correct. This lifecycle is quite different to what I had imagined. – Tunneller Apr 26 '23 at 03:41
  • 1
    And multiple requests from the same browser are not guarranteed to be served by the same instance. So multiple requests could quite realistically see different /tmp files. Ouch. – Tunneller Apr 26 '23 at 03:50
  • From above conversation, I can see that you got required inputs. Do you need any further help? – Roopa M Apr 26 '23 at 09:43
  • You can use session affinity to route the client to the same instance, as long as it exists – guillaume blaquiere Apr 26 '23 at 09:58
  • I looked into load balancing, sticky sessions, etc, and got completely confused.... But what I'm leaning to do is (a) on browser request check the GCS filenames to find the cache for that user, if that filename exists on /tmp then I think I can assume it is, in fact, a valid cache, (b) if it is not on /tmp then asynchronously launch the download from cache to /tmp and a BigQuery/Processing command to create the missing data. Both should finish at around the same time, (c) glue results back together and return to browser and then (d) asynchronously upload new cache. Does this sound reasonable? – Tunneller Apr 26 '23 at 14:07
  • What I'm researching now is ways to rapidly scan the metadata of Parquet files on GCS without downloading them. Might not be possible but I could always put the first and last timestamps into the title of the Parquet fiel. – Tunneller Apr 26 '23 at 14:08
  • 1
    Does this [thread](https://stackoverflow.com/a/59142701/18265570) helps you? – Roopa M Apr 27 '23 at 07:03
  • Hi @RoopaM , thanks yes, I saw that post and, indeed, have installed it. Pretty much exactly what I need to rapidly scan the GCS to find the specific file I need, then I check to see if it is in /tmp and if not then I launch download from GCS. I was waiting until I got it all working to post results. Today I'm going to sort out how to asynchronously do the "put" so that the GCS gets its updated file without the user waiting. Starting to look good! – Tunneller Apr 28 '23 at 12:14

1 Answers1

0

This is only part of an answer, but here are two ways I looked into extracting the metadata from bunch of files on Google Cloud Storage. Using GCSFS was more than two times faster, at least with the code below (typical ratio ~0.9 seconds vs ~2.3).

import gcsfs
import timeit
from google.cloud import storage

def method_1(fs, bucket_name, well_name):
    
    cache_dict = {}
    for f in fs.ls(bucket_name+"/"+well_name):
        if fs.isdir(f):
            continue 
        try:
           meta_job = fs.getxattr(f,"meta_job") 
           meta_filt = fs.getxattr(f,"meta_filt")
           cache_dict[f] = [meta_job,meta_filt]
        except:
           pass

    return cache_dict

def method_2(client, bucket_name, well_name):
      
    blobs = client.list_blobs(bucket_name, prefix=well_name)     
    cache_dict = {}
    for blob in blobs:
        cache_dict[blob.name] = blob.metadata
    return cache_dict

fs = gcsfs.GCSFileSystem(project = project_id)  #this is all of the buckets
client = storage.Client(project = project_id)   #need to add prefix to both
    
t1 = timeit.Timer(lambda: method_1(fs,bucket_name,well_name)) 
result1 = t1.timeit(20)    
t2 = timeit.Timer(lambda: method_2(client,bucket_name,well_name)) 
result2 = t2.timeit(20)
    
print('Gcs',result1,'blob', result2)
    
Tunneller
  • 381
  • 2
  • 13