Maximal efficiency to parse millions of json stored as a (very long) string

Question

Objective:

I have thousands of data dumps whose format, after unzip, is a long string containing 150K json separated by '\n'.

big_string = '{"mineral": "gold", "qty": 2, "garbage":"abc"}\n ....... {"mineral": "silver", "qty": 4}'

Each JSON contains dozens of useless keys like garbage, but my objective is only to sum the qty for each mineral.

result = {'gold': 213012, 'silver': 123451, 'adamantium': 321434}

How to reproduce:

import random

minerals = ['gold', 'silver', 'adamantium']

big_string = str(
    '\n'.join([
        str({'mineral': random.choice(minerals), 
             'qty': random.randint(1,1000),
             'garbage': random.randint(1,666),
             'other_garbage': random.randint(-10,10)})
        for _ in range(150000)
    ])
)

def solution(big_string):
    # Show me your move
    return dict() # or pd.DataFrame()

My current solution (which I find slower than expected):

Splitting the string using the '\n' separator, with a yield generator (see https://stackoverflow.com/a/9770397/4974431)
Loading the string in json format using ujson library (supposed to be faster than json standard lib)
Accessing the values needed only for 'mineral' and 'quantity'.
Doing the aggregation using pandas

Which gives:

import ujson
import re
import pandas as pd

# To split the big_string (from https://stackoverflow.com/a/9770397/4974431)
def lines(string, sep="\s+"):
    # warning: does not yet work if sep is a lookahead like `(?=b)`
    if sep=='':
        return (c for c in string)
    else:
        return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))

def my_solution(big_string):

    useful_fields = ['mineral', 'qty']
    filtered_data = []
    
    for line in lines(big_string, sep="\n"):
        line = ujson.loads(line)
        filtered_data.append([line[field] for field in useful_fields])

    result = pd.DataFrame(filtered_data, columns = useful_fields)
    return result.groupby('mineral')['qty'].sum().reset_index()

Any improvement, even by 25%, would be great because I have thousands to do !

This doesn't address your current issue, but you might consider storing this data in a MongoDB database. It's designed for storing big piles of JSON, and allows you do queries, including aggregations like this. — Tim Roberts, Jul 27 '22 at 19:05
Yeah, actually the data is 500 GB (yep), so I need to filter it on the mineral/qty data that only interests me, and after that I am still with 25Gb of data, that I upload and query in the Cloud. This exercice is a simplified version of my problem unfortunately. But thanks anyway ! — Vincent, Jul 27 '22 at 19:07
What is the point of using this complex regex? Why not just `string.split("\n")`? Or if the data is in the file, just loop over the file directly `for line in open(filepath): ....` — juanpa.arrivillaga, Jul 28 '22 at 18:28
You are right : I encountered memory errors so I used a generator instead — Vincent, Jul 29 '22 at 07:43

westandskif · Accepted Answer · 2022-07-28T16:50:27.733

I must confess: I'm going to use a library of mine - convtools (github)
rely on iterating io.StringIO, it splits lines by \n itself
process as a stream without additional allocations

import random
from io import StringIO
import ujson
from convtools import conversion as c

minerals = ["gold", "silver", "adamantium"]

big_string = str(
    "\n".join(
        [
            ujson.dumps(
                {
                    "mineral": random.choice(minerals),
                    "qty": random.randint(1, 1000),
                    "garbage": random.randint(1, 777),
                    "other_garbage": random.randint(-10, 10),
                }
            )
            for _ in range(150000)
        ]
    )
)

# define a conversion and generate ad hoc converter
converter = (
    c.iter(c.this.pipe(ujson.loads))
    .pipe(
        c.group_by(c.item("mineral")).aggregate(
            {
                "mineral": c.item("mineral"),
                "qty": c.ReduceFuncs.Sum(c.item("qty")),
            }
        )
    )
    .gen_converter()
)

# let's check
"""
In [48]: converter(StringIO(big_string))
Out[48]:
[{'mineral': 'silver', 'qty': 24954551},
 {'mineral': 'adamantium', 'qty': 25048483},
 {'mineral': 'gold', 'qty': 24975201}]

In [50]: OPs_solution(big_string)
Out[50]:
      mineral       qty
0  adamantium  25048483
1        gold  24975201
2      silver  24954551
"""

Let's profile:

In [53]: %timeit OPs_solution(big_string)
339 ms ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [54]: %timeit converter(StringIO(big_string))
93.2 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Quite impressive. I need to have a look at your lib! – Vincent Jul 29 '22 at 18:36 — Vincent, Jul 29 '22 at 18:36

score 0 · Answer 2 · answered Jul 27 '22 at 19:30

0

After looking for where time is spent, it appears 90% of total time is spent in the generator.

Changing it to : https://stackoverflow.com/a/59071238/4974431 was a 85% time improvement

answered Jul 27 '22 at 19:30

Vincent

1,534
3
20
42

Maximal efficiency to parse millions of json stored as a (very long) string

Objective:

How to reproduce:

My current solution (which I find slower than expected):

2 Answers2