6

I need to save once and load multiples times some big arrays in a flask application with Python 3. I originally stored these arrays on disk with the json library. In order to speed up this, I used Redis on the same machine to store the array by serializing the array in a JSON string. I wonder why I get no improvement (actually it takes more time on the server I use) whereas Redis keeps data in RAM. I guess the JSON serialization isn't optimize but I have no clue how I could speed up this:

import json
import redis
import os 
import time

current_folder = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join(current_folder, "my_file")

my_array = [1]*10000000

with open(file_path, 'w') as outfile:
    json.dump(my_array, outfile)

start_time = time.time()
with open(file_path, 'r') as infile:
    my_array = json.load(infile)
print("JSON from disk  : ", time.time() - start_time)

r = redis.Redis()
my_array_as_string = json.dumps(my_array)
r.set("my_array_as_string", my_array_as_string)

start_time = time.time()
my_array_as_string = r.get("my_array_as_string")
print("Fetch from Redis:", time.time() - start_time)

start_time = time.time()
my_array = json.loads(my_array_as_string)
print("Parse JSON      :", time.time() - start_time)

Result:

JSON from disk  : 1.075700044631958
Fetch from Redis: 0.078125
Parse JSON      : 1.0247752666473389

EDIT: it seems that fetching from redis is actually fast, but the JSON parsing is quite slow. Is there a way to fetch directly an array from Redis without the JSON serialization part ? This is what we do with pyMySQL and it is fast.

Robin
  • 1,531
  • 1
  • 15
  • 35
  • Off the top of my head I'd say that the disk version is artificially fast due to disk caching. See [here](https://stackoverflow.com/questions/11610180/how-to-measure-file-read-speed-without-caching), for example. Writing good benchmarks is hard. – Kevin Christopher Henry Sep 13 '18 at 07:10
  • I load almost 10 gigabytes of data on a 196 Gb RAM linux, you think the OS caches most of this data ? – Robin Sep 13 '18 at 07:32
  • "Usually, all physical memory not directly allocated to applications is used by the operating system for the [page cache](https://en.wikipedia.org/wiki/Page_cache)." – Kevin Christopher Henry Sep 13 '18 at 07:47
  • Thx, I updated my question to be more specific, Redis is actually much faster for accessing the data, but because I store the data as strings of JSON, the parsing part is really slow. I'm looking for a way to directly fetch the data in a python object, as we do with pyMySQL. – Robin Sep 13 '18 at 07:52
  • There's always a translation step between a stream of bytes and an in-memory Python object. That said, JSON is known to be slow so you could always try msgpack or even pickle. – Kevin Christopher Henry Sep 13 '18 at 08:10
  • Pickle is much slower, marshall is a bit faster. I didn't know about msgpack, I'll try. But here we have two translation, one from redis string to python string, and one from python string to python object. I guess pyMySQL has a really efficient translation. – Robin Sep 13 '18 at 08:34
  • Redis may store data in RAM but it's still an external process. You pay the interprocess tax when you call it. I'd suspect that `json.load` doesn't buffer the entire text before it starts parsing, yet that's what you have to do when you retrieve one big string from Redis. It's not that JSON parsing is slow, it's that you wait until you have the entire string befor you even start parsing. – Panagiotis Kanavos Sep 17 '18 at 12:14
  • @debzsud perhaps you should consider using [reJSON](https://redislabs.com/blog/redis-as-a-json-store/) and [rejson-py](https://github.com/RedisLabs/rejson-py)? You'll get better performance if you *don't* have to load the entire string from Redis in order to load or update a single element – Panagiotis Kanavos Sep 17 '18 at 12:21
  • Tried JSON, Pickle, Marshal and MsgPack. Test script in my answer below. Time taken: Pickle > JSON > Marshal > MsgPack – Roopak A Nelliat Sep 17 '18 at 12:34

3 Answers3

19

Update: Nov 08, 2019 - Run the same test on Python3.6

Results:

Dump Time: JSON > msgpack > pickle > marshal
Load Time: JSON > pickle > msgpack > marshal
Space: marshal > JSON > pickle > msgpack

+---------+-----------+-----------+-------+
| package | dump time | load time | size  |
+---------+-----------+-----------+-------+
| json    | 0.00134   | 0.00079   | 30049 |
| pickle  | 0.00023   | 0.00019   | 20059 |
| msgpack | 0.00031   | 0.00012   | 10036 |
| marshal | 0.00022   | 0.00010   | 50038 |
+---------+-----------+-----------+-------+

I tried pickle vs json vs msgpack vs marshal.

Pickle is much much faster than JSON. And msgpack is atleast 4x faster that JSON. MsgPack looks like the best option you have.

Edit: Tried marshal also. Marshal is faster than JSON, but slower than msgpack.

Time taken: Pickle > JSON > Marshal > MsgPack
Space taken: Marshal > Pickle > Json > MsgPack

import time
import json
import pickle
import msgpack
import marshal
import sys

array = [1]*10000

start_time = time.time()
json_array = json.dumps(array)
print "JSON dumps: ", time.time() - start_time
print "JSON size: ", sys.getsizeof(json_array)
start_time = time.time()
_ = json.loads(json_array)
print "JSON loads: ", time.time() - start_time

# --------------

start_time = time.time()
pickled_object = pickle.dumps(array)
print "Pickle dumps: ", time.time() - start_time
print "Pickle size: ", sys.getsizeof(pickled_object)
start_time = time.time()
_ = pickle.loads(pickled_object)
print "Pickle loads: ", time.time() - start_time


# --------------

start_time = time.time()
package = msgpack.dumps(array)
print "Msg Pack dumps: ", time.time() - start_time
print "MsgPack size: ", sys.getsizeof(package)
start_time = time.time()
_ = msgpack.loads(package)
print "Msg Pack loads: ", time.time() - start_time

# --------------

start_time = time.time()
m_package = marshal.dumps(array)
print "Marshal dumps: ", time.time() - start_time
print "Marshal size: ", sys.getsizeof(m_package)
start_time = time.time()
_ = marshal.loads(m_package)
print "Marshal loads: ", time.time() - start_time

Result:

    JSON dumps:  0.000760078430176
JSON size:  30037
JSON loads:  0.000488042831421
Pickle dumps:  0.0108790397644
Pickle size:  40043
Pickle loads:  0.0100247859955
Msg Pack dumps:  0.000202894210815
MsgPack size:  10040
Msg Pack loads:  7.58171081543e-05
Marshal dumps:  0.000118017196655
Marshal size:  50042
Marshal loads:  0.000118970870972
Anonyme2000
  • 78
  • 1
  • 1
  • 9
Roopak A Nelliat
  • 2,009
  • 3
  • 19
  • 26
  • Indeed, msgpack is about 4x faster. I wait a bit since I was looking for a more generic answer, but your answer is of great help. Fetch from Redis: 0.023797988891601562 Parse msgpack : 0.17844223976135254 – Robin Sep 17 '18 at 13:10
  • 2
    Judging from your print comments, you used Python 2, where pickle is slow and you are advised to use the C version with 'import cPickle as pickle'. On Python 3.7, I get the following save and load times: - Using json: 0.739 + 0.584 ms, 30049 bytes. - Using ujson: 0.265 + 0.136 ms, 20050 bytes. - Using pickle: 0.188 + 0.132 ms, 20059 bytes. - Using msgpack: 0.317 + 0.059 ms, 10036 bytes. - Using marshal: 0.154 + 0.081 ms, 50038 bytes. Of course, if you are storing large homogeneous arrays, use numpy and pickle: - Numpy array using pickle: 0.016 + 0.000 ms, 40192 bytes. – Stephen Simmons Oct 31 '19 at 08:07
2

Some explanation:

  1. Load data from disk doesn't always means disk access, often the data returned from in-memory OS cache, and when this happens this is even faster than get data from Redis (remove network communication from total time)

  2. The main performance killer is JSON parsing (cpt. Obvious)

  3. JSON parsing from disk most likely is done in parallel with data loading (from filestream)

  4. There is no option to parse from stream with Redis (at least I do not know such API)


You may speedup app with minimal changes just by storing your cache files on tmpfs. It is quite close to Redis setup on the same server.

Agree with @RoopakANelliat msgpack is about 4x faster than JSON. Format change will boost parsing performance (if it is possible).

Andrii Muzalevskyi
  • 3,261
  • 16
  • 20
1

I made brain-plasma specifically for this reason - fast loading and reloading of big objects in a Flask app. It's a shared-memory object namespace for Apache Arrow-serializable objects, including pickle'd bytestrings generated by pickle.dumps(...).

$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma # 10MB memory
from brain_plasma import Brain
brain = Brain()

brain['a'] = [1]*10000
brain['a']
# >>> [1,1,1,1,...]

russellthehippo
  • 402
  • 4
  • 10