I'm using a web API to call and receive data to build out an SQL database for historical energy prices. For context, energy prices are set at what are called "nodes", and each node has 20 years of historical data.
I can receive the data in JSON or XML format. I need to do one operation with the received data before I put it into the SQL database. Namely, I need to convert each hour given in Eastern Daylight Time back to its Eastern Standard Time equivalent.
Being brand new to Python (learned in last two weeks), I initially went down a path more intuitive to me:
HTTP Request (XML format) -> Parse to XML object in Python -> Convert Datetime -> Place in SQL database
The total size of the data I'm attempting to get is roughly 150GB. Because of this, I wanted to get the data in an asynchronous matter and format/put into SQL as it came in from hundreds of API calls (there's a 50000 row limit to what I can get at a time). I was using a ThreadPool to do this. Once the data was received, I attempted to use a ProcessPool to convert this data into the format I needed to place into my SQL database, but was unsuccessful.
Looking at the process from a high level, I think this process can be a lot more efficient. I think I can do the following:
HTTP Request (JSON Format) -> Parse to JSON object in Python -> Perform operation to convert datetime (map value using dictionary?) -> Place into SQL database
I just discovered the OPENJSON library in Python. Is this all I need to do this?
Another issue I need to look into are the limitations of SQLite3. Each node will have its own table in my database, so ideally I'd like to have as many instances of my program as possible getting, parsing, and putting data into my SQLite3 database.
Any help would be much appreciated!