0

Preamble: I'm very new to APIs and computer science is not my background.

So, I was testing an API, through fastapi, my function is very simple, it loads a dataset, performs a couple validations and then returns it in json format. A simplified version would look like this:

@app.get('/data/{dataset}/{version}')
async def download_data(dataset: str, version: str):
    now = time()

    file = "data/{dataset}/{dataset}_{year}.feather".format(dataset = dataset, year = version) 
    data = pd.read_feather(file)
    json_data = {"data": data.to_dict()}
    print(time() - now)
    return  json_data

If I run this code locally for a given dataset (which is stored in my computer), it takes about ~7 seconds (the result from print(time() - now)), but if I run it through the API, like this:

r = requests.get('http://127.0.0.1:8000/data/dataset_name/version_name')

It takes about 80 seconds. In my coworker's computer it takes about 30 (if he's running it locally on his machine). So, my question is why does data transmission take lots of time if everything's running on my machine? What am I missing?

Sorry I can't provide a reproducible example, not sure if possible in this case.

Paolo
  • 20,112
  • 21
  • 72
  • 113
Juan C
  • 5,846
  • 2
  • 17
  • 51
  • 2
    If it takes 7 seconds to read a file and convert it to a dictionary, then the data must be very large. Remember that your network is not infinitely fast. How large is the data? – Tim Roberts Jul 06 '23 at 20:12
  • 2
    A piece of the puzzle: its an async function with an expensive synchronous call: `data = pd.read_feather(file)`. That blocks all other async operations for the duration, leading to delays. See for instance https://stackoverflow.com/questions/54685210/calling-sync-functions-from-async-function – tdelaney Jul 06 '23 at 20:20
  • Thanks to both! It's about 135MB in this case. Even when working locally, data is transmitted through the network? how can I check my local network speed limits? @TimRoberts – Juan C Jul 06 '23 at 20:22
  • 2
    127.0.0.1 does not hit the wire so it will be fast. There is some protocol overhead so you may get a bit of a delay for a 137M transfer, but it should not be too bad. And you'll have several copies of that 137M (server has original dataframe and json, client is rebuilding the dict) so it may cause problems on a low memory machine. Not the top factors IMHO, – tdelaney Jul 06 '23 at 20:35
  • It's a 16GB machine, using ~60% previous to running the code, I don't know if that helps – Juan C Jul 06 '23 at 20:39
  • When you run it "locally", are you converting to JSON? `requests` is going to do that, so the 135MB is going to grow significantly. – Tim Roberts Jul 06 '23 at 21:44
  • Aaah, no I'm not, it's just like the code. If I do it through Python, should I get some efficiency gains? – Juan C Jul 06 '23 at 21:47
  • thanks @TimRoberts! it was as easy as adding `json_string = json.dumps(json_data)` and returning this object. Had no idea about how request parsed responses. Maybe add it as an answer so I can verify it. Went from ~80 to 16 seconds. – Juan C Jul 06 '23 at 21:59

1 Answers1

0

Running synchronous blocking calls like pd.read_feather (which waits on the disk) can play havoc with asyncio. All other async tasks are blocked and cannot run until that feather call (~7 seconds one would guess) completes. Instead, run that call in the background and let your async web server keep up with its other tasks.

def feather_to_dict(file):
    data = pd.read_feather(file)
    return {"data": data.to_dict()}

@app.get('/data/{dataset}/{version}')
async def download_data(dataset: str, version: str):
    now = time()
    file = "data/{dataset}/{dataset}_{year}.feather".format(dataset = dataset, year = version) 
    loop = asyncio.get_event_loop()
    json_data = await loop.run_in_executor(None, feather_to_dict, (file,))
    print(time() - now)
    return  json_data
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • Thanks @tdelaney, I'll try what you suggest! – Juan C Jul 06 '23 at 20:38
  • I'm getting a `AttributeError: 'function' object has no attribute 'submit'` on the await call. I'm finding nothing on the error. Do you know what could be happening? – Juan C Jul 06 '23 at 21:26
  • 1
    changed it to `json_data = await loop.run_in_executor(None, read_data, file)` and it worked! I won about 1-2 seconds, but it still troubles my mind that it's taking ~8 times more to run that code through a local get request than running the code by itself – Juan C Jul 06 '23 at 21:37
  • @JuanC - Fixed the call. I wasn't sure how much good it would do, hoping for more than 2 seconds. Its a needed change, but not the fix. – tdelaney Jul 06 '23 at 21:42
  • It's very appreciated! Nice to have correct code and could be useful in the future – Juan C Jul 06 '23 at 21:49