2

I'm downloading data from polygon api and after checking the documentation, I realized that there is some kind of a rate limit in terms of response size which will consist of 5000 records per request. Let's say I need to download several months worth of data, it looks like there is no one-liner solution that fetches all the data for the specified period at once.

Here's what the response looks like for 4 day data points that I get using requests.get('query').json():

{
   "ticker":"AAPL",
   "status":"OK",
   "queryCount":4,
   "resultsCount":4,
   "adjusted":True,
   "results":[
      {
         "v":152050116.0,
         "vw":132.8458,
         "o":132.76,
         "c":134.18,
         "h":134.8,
         "l":130.53,
         "t":1598932800000,
         "n":1
      },
      {
         "v":200117202.0,
         "vw":131.6134,
         "o":137.59,
         "c":131.4,
         "h":137.98,
         "l":127,
         "t":1599019200000,
         "n":1
      },
      {
         "v":257589206.0,
         "vw":123.526,
         "o":126.91,
         "c":120.88,
         "h":128.84,
         "l":120.5,
         "t":1599105600000,
         "n":1
      },
      {
         "v":336546289.0,
         "vw":117.9427,
         "o":120.07,
         "c":120.96,
         "h":123.7,
         "l":110.89,
         "t":1599192000000,
         "n":1
      }
   ],
   "request_id":"bf5f3d5baa930697621b97269f9ccaeb"
}

I thought the fastest way is to write the content as is and process it later

with open(out_file, 'a') as out:
    out.write(f'{response.json()["results"][0]}\n')

And later after I download what I needed, will read the file and convert the data to a json file using pandas:

pd.DataFrame([eval(item) for item in open('out_file.txt')]).to_json('out_file.json')

Is there a better way of achieving the same thing? If anyone is familiar with scrapy feed exports, is there a way of dumping the data to json file during the run without saving anything to memory which i think is the same fashion as scrapy operates.

1 Answers1

0

Instead of writing out the content as text, write it directly as a JSON instead with a unique filename (e.g. your request_id).

import json

# code for fetching data omitted.
data = response.json()

with open(out_file, 'w') as f:
    json.dump(data, f)

Then you can load all of them into Dataframes, e.g. similar to here: How to read multiple json files into pandas dataframe?:

from pathlib import Path # Python 3.5+

import pandas as pd

dfs = []

for path in Path('dumped').rglob('*.json'):
    tmp = pd.read_json(path)
    dfs.append(tmp)

df = pd.concat(dfs, ignore_index=True)

dh762
  • 2,259
  • 4
  • 25
  • 44
  • I remember doing something similar and i got json complaining about some types conflict, did you ensure it works? And consider this sequence will be called thousands of times if not millions so, do you think this would be faster than the way i did it? And will it append to the json file? –  Sep 07 '20 at 08:02
  • you effectively need to make sure your data is not duplicated, so you probably need to persist it somewhere. Regarding faster: the limiting factor are network calls, even more so if you are being rate-limited. So it doesn't matter. – dh762 Sep 07 '20 at 08:05