Objective:
I have thousands of data dumps whose format, after unzip, is a long string containing 150K json separated by '\n'.
big_string = '{"mineral": "gold", "qty": 2, "garbage":"abc"}\n ....... {"mineral": "silver", "qty": 4}'
Each JSON contains dozens of useless keys like garbage
, but my objective is only to sum the qty
for each mineral
.
result = {'gold': 213012, 'silver': 123451, 'adamantium': 321434}
How to reproduce:
import random
minerals = ['gold', 'silver', 'adamantium']
big_string = str(
'\n'.join([
str({'mineral': random.choice(minerals),
'qty': random.randint(1,1000),
'garbage': random.randint(1,666),
'other_garbage': random.randint(-10,10)})
for _ in range(150000)
])
)
def solution(big_string):
# Show me your move
return dict() # or pd.DataFrame()
My current solution (which I find slower than expected):
- Splitting the string using the '\n' separator, with a yield generator (see https://stackoverflow.com/a/9770397/4974431)
- Loading the string in json format using ujson library (supposed to be faster than json standard lib)
- Accessing the values needed only for 'mineral' and 'quantity'.
- Doing the aggregation using pandas
Which gives:
import ujson
import re
import pandas as pd
# To split the big_string (from https://stackoverflow.com/a/9770397/4974431)
def lines(string, sep="\s+"):
# warning: does not yet work if sep is a lookahead like `(?=b)`
if sep=='':
return (c for c in string)
else:
return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
def my_solution(big_string):
useful_fields = ['mineral', 'qty']
filtered_data = []
for line in lines(big_string, sep="\n"):
line = ujson.loads(line)
filtered_data.append([line[field] for field in useful_fields])
result = pd.DataFrame(filtered_data, columns = useful_fields)
return result.groupby('mineral')['qty'].sum().reset_index()
Any improvement, even by 25%, would be great because I have thousands to do !