Optimising the cost of pandas dataframe to json

Question

My goal is to sort the data frame by 1 column and return a json object as efficiently as possible.

For repoduction, please define the following dataframe:

import pandas as pd
import numpy as np
test = pd.DataFrame(data={'a':[np.random.randint(0,100) for i in range(10000)], 'b':[i + np.random.randint(0,100) for i in range(10000)]})

       a      b
0     74     89
1     55     52
2     53     39
3     26     21
4     69     34

What I need to do is sort by column a and then encode the output in a json object. I'm taking the basic approach and doing:

test.sort_values('a', ascending=True, inplace=True) # n log n
data = [{}] # 1
for d in test.itertuples(): # n times
    to_append = {'id': d.Index, 'data': {'a': d.a, 'b': d.b}} # 3 
    data.append(to_append) # 1

So is the cost nlogn + n*4? Are there any more efficient ways of doing it?

@user It's integer/float. I thought about creating an ordered dictionary and placing the data there straight away without sorting e.g. `d[a] = { # something }` and then converting to json — GRS, Aug 26 '18 at 19:27
Ordered dict would be the wrong choice since it keeps track of the order the data is inserted in. If a is only integer you could make use of the hashfunction of a dict since the natural order of a dict is based on hashing the keys and in case of integers the hash is the integer (some exceptions exist like -1). However floats are messing up the concept but you can do it with a workaround similar to this: https://stackoverflow.com/questions/23721230/float-values-as-dictionary-key — user, Aug 26 '18 at 19:35
@user I think all of 'a' in my case is integers, but I just wanted to see if possible to generilise — GRS, Aug 26 '18 at 19:39
if you can use one of the outputs supported by pandas.to_json() you'll get much better runtime performance, as pandas runs Cython internally and is much faster than a regular python for loop. — PabTorre, Aug 26 '18 at 19:42
@PabTorre Unfortunately, it doesn't do the same format. Also then the code is also n log n, and I was hoping to see a o(n) solution without sorting somehow — GRS, Aug 26 '18 at 19:44
You could try this: https://www.geeksforgeeks.org/sorting-using-trivial-hash-function/ but you can ran into space limitations... — user, Aug 26 '18 at 20:15

score 1 · Accepted Answer · answered Aug 26 '18 at 22:00

I've noticed that pandas reads and writes JSON slower than pure python. If you're sure of the fact that there are only two columns, you can do something like this:

data = [{'id' : x, 'data' : {'a' : y, 'b' : z}} 
            for x, (y, z) in zip(test.index, test.values.tolist())] 
json.dumps(data)

If you have more columns to worry about, you could do something like:

c = test.columns
data = [{'id' : x, 'data' : dict(zip(c, y))} 
            for x, *y in zip(test.index, test.values.tolist())]
json.dumps(data)

Or, if you can handle it, do a reset_index call before saving:

c = test.columns
data = [{'id' : x[0], 'data' : dict(zip(c, x[1:]))} 
            for x in test.reset_index().values.tolist()]
json.dumps(data)

Optimising the cost of pandas dataframe to json

1 Answers1