I have a large json
file (Wikidata dump, to be more specific) compressed as gzip
. What I want to achieve is build an index, such that I can do random access and retrieve the line/entity I desire. The brute force way to find a line (entity) of interest would be:
from gzip import GzipFile
with GzipFile("path-to-wikidata/latest-all.json.gz", "r") as dump:
for line in dump:
# ....
An alternative that I know of is to use hdf5
, do one pass over the dump, and store everything of interest in the hdf5
file. However, the issue with approach is that even one pass over Wikidata is super slow, and writing millions of entries in the hdf5
file takes a while.
Finally, I looked into indexed_gzip
, using which I can seek to a random location of the file, and then read a sequence of bytes from it, as
import indexed_gzip as igzip
wikidata = igzip.IndexedGzipFile("path-to-wikidata/latest-all.json.gz")
# Seek to a location towards the end of the file
offset = 10000000000
# Seek to the desired location
wikidata.seek(offset)
# Read a sequence of bytes
length_of_sequence = 100000
data_bytes = wikidata.read(length_of_sequence)
however, the seeking takes extremely long in certain cases, e.g., when indexing chunks further from the start of the file. Note that this occurs only the first time I index the location, every subsequent index is same as indexing the 0 element. Evidence bellow:
# Example of entity2index mapping: Q31 --> [offset, length]
# File is ordered based on how the dump is iterated, e.g.,
# the first entity in the dictionary is first in Wikidata
entity2index: OrderedDict[str, Tuple[int, int]] = json.load(open("path-to-wikidata/wikidata_index.json"))
# Wikidata dump
wikidata = igzip.IndexedGzipFile("path-to-wikidata/latest-all.json.gz")
# List of entities
entities = list(entity2index.keys())
# Testing starts
entity = entities[0]
offset, _ = entity2index[entity]
# 367 µs ± 139 µs per loop (mean ± std. dev. of 7 runs, 2 loops each)
%timeit -n 2 wikidata.seek(offset)
entity = entities[1000000]
offset, _ = entity2index[entity]
# The slowest run took 92861.95 times longer than the fastest. This # could mean that an intermediate result is being cached.
# 2.18 s ± 5.33 s per loop (mean ± std. dev. of 7 runs, 2 loops each)
%timeit -n 2 wikidata.seek(offset)
With that said, I am interested in (1) either overcoming the issue of the first indexing being significantly slower than every subsequent one, (2) any alternatives which could be better?