8

I have a single json file which is about 36 GB (coming from wikidata) and I want to access it more efficiently. Currently I'm using rapidjsons SAX-style API in C++ - but parsing the whole file takes on my machine about 7415200 ms (=120 minutes). I want to access the json objects inside this file, according to one of two primary keys ('name' or 'entity-key' -> i.e. 'Stack Overflow' or 'Q549037') which are inside the json object. That means I have to parse the whole file currently in the worst case.

So I thought about two approaches:

  • splitting the big file in billions of small files - with a filename that indicates the name/entity-key (i.e. Q549037.json / Stack_Overflow.json or Q549037#Stack_Overflow.json) -> not sure about the overload in storage
  • building some kind of index from the primary keys to the ftell() position in the file. Building the index should take around 120 minutes (like parsing now), but accessing should be faster then
    • i.e. use something like two std::unorderedmap (could run into memory problems again)
    • index files - create two files: one with entries sorted by name and one sorted by entity-key (creating these files will probably take much longer, because of the sorting)

What is the best-practice for a problem like this? Which approach should I follow? Any other ideas?

Constantin
  • 8,721
  • 13
  • 75
  • 126
  • 1
    if you decide to write your own think about using a proper parse tree instead of unordered map. `boost::intrusive` might be of help. – Alexander Oh Feb 08 '15 at 20:44
  • 1
    You could also consider using MongoDB or some other NoSQL database to store the JSON data. From what I understand this is kind of the (initial) point of NoSQL databases. – Félix Cantournet Feb 08 '15 at 22:27
  • 1
    I just realized that your throughput is only about 5MB/s. On my machine in-memory SAX parsing is around 300MB/s. DOM is about 200MB/s. So for big file it should be I/O bound. I suspect something wrong. Are u using the latest version at github? – Milo Yip Feb 09 '15 at 14:50
  • @MiloYip Yes, that's right - with 300 MB/s it should only take about 2 minutes: I guess I have to make a few more checks, why it is as slow as 5 MB/s on my PC. – Constantin Feb 10 '15 at 04:38

4 Answers4

3

I think the performance problem is not due to parsing. Using RapidJSON's SAX API should already give good performance and memory friendly. If you need to access every values in the JSON, this may already be the best solution.

However, from the question description, it seems reading all values at a time is not your requirement. You want to read some (probably small amount) values of particular criteria (e.g., by primary keys). Then reading/parsing everything is not suitable for this case.

You will need some indexing mechanism. Doing that with file position may be possible. If data at the positions also a valid JSON, you can seek and stream it to RapidJSON to parse that JSON value (RapidJSON can stop parsing when a complete JSON is parsed, by kParseStopWhenDoneFlag).

Other options are converting the JSON into some kind of database, either SQL database, key-value database or custom ones. With the provided indexing facilities, you shall query the data fast. This may take long time for conversion, but good performance for later retrieval.

Note that, JSON is an exchange format. It was not designed for fast individual queries on big data.


Update: Recently I found that there is a project semi-index that may suit your needs.

Milo Yip
  • 4,902
  • 2
  • 25
  • 27
1

Write your own JSON parser minimizing allocations and data movement. Also ditch multi character for straight ANSI. I once wrote a XML parser to parse 4GB Xml files. I tried MSXML and Xerces both had minor memory leaks that when used on that much data would actually runout of memory. My parser would actually stop memory allocations once it reached maximum nesting level.

user2433030
  • 21
  • 1
  • 3
1

Your definition of the problem does not allow to give a precise answer.

I wonder why you would want to stick to JSON in the first place. It is certainly not the best format for rapid access to big data.

If you're using your wikia data intensively, why not convert them into a more manageable format altogether?

It should be easy to automate a DB definition that matches the format of your entries, and convert the big lump of JSON into DB records once and for all.

You can stop DB conversion at any point you like (i.e. store each JSON block as plain text or refine it further).
In the minimal case, you'll end up with a DB table holding your records indexed by name and key.
Certainly less messy than using your file system as a database (by creating millions of files named after name+key) or writing dedicated code to seek to the records.

That will probably save you a lot of disk space too, since internal DB storage is usually more efficient than plain textual representation.

kuroi neko
  • 8,479
  • 1
  • 19
  • 43
0

I've done a bit of parsing of data out of wikipedia. I'm particularly interested in extracting the equations so I'm only interested in part of the file.

Firstly if its WikiMedia data your interested in, it much easier to get a Labs account. It takes about a day to do and it will let you run much of the code on their machines, avoiding the need to downloading multiple gigabytes. With a Labs account you should be able to run code on a fairly up to date replication of the database avoiding the need to json entirely.

I use a simple python program to parse the data it basically runs a few regexps on each line; one to find lines containing <title>...</title> so I know which wikipedia article it is and a few more to find the namespace and the maths tags. It can process 160MB file in 13 seconds, so might be able to do the whole 36GB in under an hour.

This code produces text files with only the data I'm interested in. If you interested the code is

import sys
import re

dump = len(sys.argv)>1 and sys.argv[1]=='-d'
titleRE = re.compile('<title>(.*)</title>')
nsRE = re.compile('<ns>(.*)</ns>')
mathRE = re.compile('&lt;/?math(.*?)&gt;')
pageEndRE = re.compile('</page>')
supOc = 0
supCc = 0
subOc = 0
subCc = 0

title =""
attr = ""
ns = -1
inEqn = 0
for line in sys.stdin:
    m = titleRE.search(line)
    if m :
        title = m.group(1)
        expression = ""
        if dump : print line
        inEqn = 0
    m = nsRE.search(line)
    if m :
        ns = m.group(1)
    start = 0
    pos = 0
    m = mathRE.search(line,pos)
    while m :
        if m.group().startswith('&lt;math'):
            attr = m.group(1)
            start = m.end()
            pos = start
            expression = ""
            inEqn = 1
        if m.group() == '&lt;/math&gt;' :
            end = m.start()
            expression = '    '.join([expression,line[start:end]])
            print title,'\t',attr,'\t',expression.lstrip().replace('&lt;','<').replace('&gt;',
'>').replace('&amp;','&')
            pos = m.end()
            expression = ""
            start = 0
            inEqn = 0
        m = mathRE.search(line,pos)
    if start > 0 :
        expression = line[start:].rstrip()
    elif inEqn :
        expression = '    '.join([expression,line.rstrip()])

Sorry if its a bit cryptic, but it was not ment for public consumption. Sample output is

Arithmetic mean         a_1,\ldots,a_n.
Arithmetic mean         A
Arithmetic mean         A=\frac{1}{n}\sum_{i=1}^{n} a_i
Arithmetic mean         \bar{x}

Each line has the name of article and the latex equation. This reduces the data I need to work with down to a more manageable 500k. I'm not sure is such a strategy would work for your application.

For the main enwiki data the split the xml dumps into 27 smaller files, of roughly equal size. You might find a few reasonable size files, easier to work with than either one giant file or millions of tiny files. It might be easy to split by first letter in the article title giving less than a hundred files each less than a gigabyte.

Salix alba
  • 7,536
  • 2
  • 32
  • 38