0

I'm using a web API to call and receive data to build out an SQL database for historical energy prices. For context, energy prices are set at what are called "nodes", and each node has 20 years of historical data.

I can receive the data in JSON or XML format. I need to do one operation with the received data before I put it into the SQL database. Namely, I need to convert each hour given in Eastern Daylight Time back to its Eastern Standard Time equivalent.

Being brand new to Python (learned in last two weeks), I initially went down a path more intuitive to me:

HTTP Request (XML format) -> Parse to XML object in Python -> Convert Datetime -> Place in SQL database

The total size of the data I'm attempting to get is roughly 150GB. Because of this, I wanted to get the data in an asynchronous matter and format/put into SQL as it came in from hundreds of API calls (there's a 50000 row limit to what I can get at a time). I was using a ThreadPool to do this. Once the data was received, I attempted to use a ProcessPool to convert this data into the format I needed to place into my SQL database, but was unsuccessful.

Looking at the process from a high level, I think this process can be a lot more efficient. I think I can do the following:

HTTP Request (JSON Format) -> Parse to JSON object in Python -> Perform operation to convert datetime (map value using dictionary?) -> Place into SQL database

I just discovered the OPENJSON library in Python. Is this all I need to do this?

Another issue I need to look into are the limitations of SQLite3. Each node will have its own table in my database, so ideally I'd like to have as many instances of my program as possible getting, parsing, and putting data into my SQLite3 database.

Any help would be much appreciated!

user868178
  • 11
  • 3

1 Answers1

0

There is no definite answer to you question given so many unknowns but I can outline the way how to get to the solution.

Factors That Influence Performance

The processing is done in stages as you described (I'll abstract away the actual format for now for the reasons I'll describe a bit later):

  1. Fetch data from the remote service
  2. Parse data
  3. Convert data
  4. Store into local DB

For every stage there are some limiting factors that does not allow you to increase processing speed.

For fetching data some of them are:

  1. network bandwidth.
  2. parallelism that remote server supports: remote server may throttle connections and/or total speed for single user or it may be required by terms of usage to limit this on client side.
  3. data format used when downloading. Different formats add their own amount of unneeded/boilerplate formatting and/or data that is sent over network. It depends on the service and its API but it may be that returned XML is smaller than JSON so even XML is usually more verbose for your particular case XML is better.
  4. RAM amount (and swap speed) may be a limit on your system in (very unlikely) case that factors #1 and #2 do not limit you. In this case the downloaded date may not fit into RAM and will be swapped to disk and this will slowdown the download process.

For parsing the data:

  1. RAM amount (for the same reasons as above)
  2. Data format used.
  3. The parser used. Different parsers implementations for JSON for example have different speed.
  4. CPU power: speed and number of processing units.

For data conversion:

  1. RAM amount
  2. CPU power

For data storing:

  1. disk speed
  2. parallelism level the DB efficiently supports

These are not all factors that limit processing speed but just some most obvious. There are also some other unknown limitations.

Also there may be some overhead when passing data between stages. It depends on the design. In some designs (for example single process that reads the data from remote server, processes it in memory and stores to database) the overhead may be zero, but in some designs (multiple processes read data and stores it to files, another set of processes open these files and processes them and so on) the overhead may be quite big.

The final speed of processing is defined by speed of the slowest stage or speed of data passing between stages.

Not all of these factors can be predicted when you design a solution or choose between several designs. Given that there are unknown factors this is even more complicated.

Approach

To be systematic I would use the following approach:

  1. create simple solution (like single process reads data processes and stores to database)
  2. find the processing speed of every phase using that solution
  3. when you have processing speed of every phase look to the slowest phase (note that make sense to look only to the slowest as it defines the overall speed)
  4. then find
    • why it is slow?
    • what limits the speed and if that can be improved?
    • what is the theoretical limit of that stage? (for example if you have 1Gb network and one processing box you can't read data with the speed greater than 120MB/s, in practice it will be even smaller).
  5. Improve. The improvement is usually
    • optimize processing (like choose better format or library for parsing, remove operations that can be avoided etc) of single processor. If you hit (or is close to) the theoretical limit of the processing speed, you can't use this option.
    • add more parallelism

In general when you try to optimize something you need to have numbers and compare them when you are doing experiments.

Parallelism

Python

You should be careful when choose between threads and processes. As for example threads are not good for CPU intensive tasks. See more information on this Multiprocessing vs Threading Python

SQLite

SQLite may have some limitations when multiple processes work with single databases. You need to check if it is the limiting factor of your speed. Maybe you need to use another database that better fits for parallelism and then as an additional final step dump the data from it to SQLite in single shot (that would only require to read data sequentially and store it in SQLite and that may be much more efficient if compared to parallel write to single SQLite DB).