0

I have a large json file (Wikidata dump, to be more specific) compressed as gzip. What I want to achieve is build an index, such that I can do random access and retrieve the line/entity I desire. The brute force way to find a line (entity) of interest would be:

from gzip import GzipFile

with GzipFile("path-to-wikidata/latest-all.json.gz", "r") as dump:
    for line in dump:
         # ....

An alternative that I know of is to use hdf5, do one pass over the dump, and store everything of interest in the hdf5 file. However, the issue with approach is that even one pass over Wikidata is super slow, and writing millions of entries in the hdf5 file takes a while.

Finally, I looked into indexed_gzip, using which I can seek to a random location of the file, and then read a sequence of bytes from it, as

import indexed_gzip as igzip

wikidata = igzip.IndexedGzipFile("path-to-wikidata/latest-all.json.gz")
# Seek to a location towards the end of the file
offset = 10000000000
# Seek to the desired location
wikidata.seek(offset)
# Read a sequence of bytes
length_of_sequence = 100000
data_bytes = wikidata.read(length_of_sequence)

however, the seeking takes extremely long in certain cases, e.g., when indexing chunks further from the start of the file. Note that this occurs only the first time I index the location, every subsequent index is same as indexing the 0 element. Evidence bellow:

# Example of entity2index mapping: Q31 --> [offset, length]
# File is ordered based on how the dump is iterated, e.g.,
# the first entity in the dictionary is first in Wikidata
entity2index: OrderedDict[str, Tuple[int, int]] = json.load(open("path-to-wikidata/wikidata_index.json"))

# Wikidata dump
wikidata = igzip.IndexedGzipFile("path-to-wikidata/latest-all.json.gz")

# List of entities
entities = list(entity2index.keys())

# Testing starts
entity = entities[0]
offset, _ = entity2index[entity]
# 367 µs ± 139 µs per loop (mean ± std. dev. of 7 runs, 2 loops each)
%timeit -n 2 wikidata.seek(offset)

entity = entities[1000000]
offset, _ = entity2index[entity]
# The slowest run took 92861.95 times longer than the fastest. This # could mean that an intermediate result is being cached.
# 2.18 s ± 5.33 s per loop (mean ± std. dev. of 7 runs, 2 loops each)
%timeit -n 2 wikidata.seek(offset)

With that said, I am interested in (1) either overcoming the issue of the first indexing being significantly slower than every subsequent one, (2) any alternatives which could be better?

gorjan
  • 5,405
  • 2
  • 20
  • 40
  • 2
    It looks like `indexed_gzip` is exactly what you're looking for. The access time should not depend on where in the gzip file it is. What is your evidence that it is taking longer towards the end? – Mark Adler Nov 16 '22 at 16:01
  • Hello Mark! Thanks a lot for spending time to read my question! I updated the question with the evidence you asked for. – gorjan Nov 16 '22 at 16:27
  • 1
    If it's only the first time, then it is working as expected. It needs to read and process the entire gzip file to build the index, once. Then all accesses will be fast. – Mark Adler Nov 16 '22 at 18:42
  • Starting with the bzip2 file and using something like seek-bzip is another alternative https://stackoverflow.com/a/3701268/167425 – Tom Morris Nov 16 '22 at 18:51
  • Thanks @MarkAdler, @TomMorris. I was able solve it using `indexed_gzip` by pre-computing the index only once. Posted an answer describing the solution. – gorjan Nov 16 '22 at 20:22

1 Answers1

0

Thanks to the comment by Mark Adler, I was able to resolve the issue by pre-computing and storing two index files on disk. The first one being a dictionary, mentioned in the question, where I can map from each entity id, e.g., Q31, to the offset and length in the latest-all.json.gz file. The second, helps to achieve fast seeks, which I obtained as per the documentation of igzip:

wikidata = igzip.IndexedGzipFile("path-to-wikidata/path-to-wikidata/latest-all.json.gz")
wikidata.build_full_index()
wikidata.export_index("path-to-wikidata/wikidata_seek_index.gzidx")

Then, if when I want to retrieve the data for a corresponding Wikidata entity, I do:

# First index file, mapping from Q31 --> offset and length of the chunk of data for that entity
entity2index = json.load(open("path-to-wikidata/wikidata_index.json"))
# Wikidata load + seeking index
wikidata = igzip.IndexedGzipFile("path-to-wikidata/latest-all.json.gz", index_file="path-to-wikidata/wikidata_seek_index.gzidx")

# Get the offset and length of the entity
offset, length = entity2index["Q41421"]
# Seek to the location
wikidata.seek(offset)
# Obtain the data chunk
data_bytes = wikidata.read(length)
# Load the data from the byte array
data = json.loads(data_bytes)
gorjan
  • 5,405
  • 2
  • 20
  • 40