Really long strings in JSON (>1 gig) with Jackson token stream

Question

I'm trying to write some code processing JSON document with extremely long string values (longer than 1 billion characters) stored in file. I don't want to keep whole strings in memory (since I can process them in stream). But I can't find such option in Jackson parser. What I've done so far is this test using Jackson token offsets (first round of reading file) and random access file to process strings in stream (second round of reading file):

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.OutputStream;
import java.io.PrintWriter;
import java.io.RandomAccessFile;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.Map;

import com.fasterxml.jackson.core.JsonFactory;
import com.fasterxml.jackson.core.JsonParser;
import com.fasterxml.jackson.core.JsonToken;
import com.fasterxml.jackson.databind.MappingJsonFactory;

public class LongStringJsonTest {
    public static void main(String[] args) throws Exception {
        File tempJson = new File("temp.json");
        PrintWriter pw = new PrintWriter(tempJson);
        pw.print("{\"k1\": {\"k11\": \"");
        for (int i = 0; i < 1e8; i++)
            pw.print("abcdefghij"); 
        pw.print("\"}, \"k2\": \"klmnopqrst\", " +
                "\"k3\": [\"uvwxyz\", \"0123\"]}");
        pw.close();
        searchForStrings(tempJson);
    }

    private static void searchForStrings(File tempJson) throws Exception {
        JsonFactory f = new MappingJsonFactory();
        JsonParser jp = f.createParser(tempJson);
        Map<Long, Long> stringStartToNext = new HashMap<Long, Long>();
        long lastStringStart = -1;
        boolean wasFieldBeforeString = false;
        while (true) {
            JsonToken token = jp.nextToken();
            if (token == null)
                break;
            if (lastStringStart >= 0) {
                stringStartToNext.put(lastStringStart, (wasFieldBeforeString ? -1 : 1) *
                        jp.getTokenLocation().getByteOffset());
                lastStringStart = -1;
                wasFieldBeforeString = false;
            }
            if (token == JsonToken.FIELD_NAME) {
                wasFieldBeforeString = true;
            } else if (token == JsonToken.VALUE_STRING) {
                lastStringStart = jp.getTokenLocation().getByteOffset();
            } else {
                wasFieldBeforeString = false;
            }
        }
        jp.close();
        jp = f.createParser(tempJson);
        RandomAccessFile raf = new RandomAccessFile(tempJson, "r");
        while (true) {
            JsonToken token = jp.nextToken();
            if (token == null)
                break;
            if (token == JsonToken.VALUE_STRING) {
                long start = jp.getTokenLocation().getByteOffset();
                long end = stringStartToNext.get(start);
                // You are able to process stream without keeping all bytes in memory.
                // Here you see strings including quotes around them.
                final long[] length = new long[] {0};
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                OutputStream os = new OutputStream() {
                    @Override
                    public void write(int b) throws IOException {
                        throw new IOException("Method is not supported");
                    }
                    @Override
                    public void write(byte[] b, int off, int len)
                            throws IOException {
                        if (baos.size() < 20) {
                            baos.write(b, off, Math.min(len, 20));
                            baos.write((int)'.');
                            baos.write((int)'.');
                            baos.write((int)'.');
                        }
                        if (len > 0)
                            length[0] += len;
                    }
                };
                processString(raf, start, end, os);
                String text = new String(baos.toByteArray(), Charset.forName("utf-8"));
                System.out.println("String: " + text + ", length=" + length[0]);
            }
        }
        jp.close();
        raf.close();
    }

    private static void processString(RandomAccessFile raf, long start, long end, 
            OutputStream os) throws Exception {
        boolean wasFieldBeforeString = end < 0;
        int quoteNum = wasFieldBeforeString ? 3 : 1;
        end = Math.abs(end);
        byte[] buffer = new byte[10000];
        raf.seek(start);
        boolean afterBackSlash = false;
        int strLen = (int)(end - start);
        for (int chunk = 0; strLen > 0; chunk++) {
            int ret = raf.read(buffer, 0, Math.min(buffer.length, strLen));
            if (ret < 0)
                break;
            if (ret > 0) {
                int offset = 0;
                if (chunk == 0) {
                    // Assumption that key string doesn't contain double quotes 
                    // and it's shorter than buffer size (for simplicity)
                    for (int n = 0; n < quoteNum; n++) {
                        while (true) {
                            if (buffer[offset] == '\"' && !afterBackSlash) {
                                break;
                            } else if (buffer[offset] == '\\') {
                                afterBackSlash = !afterBackSlash;
                            } else {
                                afterBackSlash = false;
                            }
                            offset++;
                        }
                        offset++;
                    }
                    offset--;
                    ret -= offset;
                }
                // Searching for ending quote
                int endQuotePos = offset + (chunk == 0 ? 1 : 0); // Skip open quote
                while (endQuotePos < offset + ret) {
                    if (buffer[endQuotePos] == '\"' && !afterBackSlash) {
                        break;
                    } else if (buffer[endQuotePos] == '\\') {
                        afterBackSlash = !afterBackSlash;
                    } else {
                        afterBackSlash = false;
                    }
                    endQuotePos++;
                }
                if (endQuotePos < offset + ret) {
                    os.write(buffer, offset, endQuotePos + 1 - offset);
                    break;
                }
                os.write(buffer, offset, ret);
                strLen -= ret;
            }
        }
    }
}

This approach doesn't support unicode at all. I'm curious is there any way to do it better (or even with help of some other libs)?

I would move to an event based parser. Look at this question http://stackoverflow.com/questions/444380/is-there-a-streaming-api-for-json . Using a "streaming" or "event" parser will allow you to hold smaller sized peices of the JSON data at any given time. I dont have time right now to write an awnser or I would ;) — ug_, Sep 12 '15 at 01:06
You should seriously reconsider whether JSON is the right choice if you have a single value of 1 billion characters. What single text value is that big, excluding hex or base64 encodings of binary data? — Andreas, Sep 12 '15 at 01:59
It's going to be part of genomic service allowing to store user data related to genomes and/or computational biology objects. JSON format can reflect all variety of typed objects (with sub-lists, sub-maps and so on recursively). It seems to be standard for serialized documents. Why can't we think about using it for biological documents as well (I mean in general aside of particular algorithmic problems)? On the other hand you might be right that it could require to process long strings and save them separately from main document with external references. — rsutormin, Sep 12 '15 at 03:23
But still this is one of tests checking what is possible and what is not. What alternative would you suggest for complex biological documents with many internal levels possibly including long dna sequences? — rsutormin, Sep 12 '15 at 03:24
@rsutormin JSON is absolutely *not* the "standard for serialized documents"... It was originally a subset of JS. If I were you, I'd seriously consider using something else. Anything else. — Navin, Sep 12 '15 at 03:35
@Navin, ok, may be I used wrong term. But it's listed here: https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats — rsutormin, Sep 12 '15 at 04:05
@Navin, I will. I need to solve this problem anyway even if I have to switch to different lib/standard. I just started from JSON+Jackson because it seemed to me to be worth trying. — rsutormin, Sep 12 '15 at 04:12
@rsutormin Its hard to say exactly what a better option for data storage would be but I agree with Navin on *...something else. Anything else*, I would highly consider any SQL database, if your data structure doesnt fit well into that format (relational database) then I would look at using something like MongoDB. However given how common SQL format is it would certainly be my frist place to look at for data storage, not to mention all the other perks that come with a strong database (indexing, foreign keys, strong datatypes) — ug_, Sep 12 '15 at 04:30
@ug_ There will not be no predefined data structure in storage. There is dynamic set of types (in case of JSON it will be JSON schema). Users should be able to define their own types (with some restrictions but not as fixed structure of tables and relations between them). When type is added anybody should be able to upload documents of this type (with JSON validation for correspondence between document and type). Separately there should be some tool for extraction of different parts of document by request through API (in case of JSON it would be done using token streams too). — rsutormin, Sep 12 '15 at 05:06
@ug_ But if you have in mind any alternative for all these needs feel free to share. But it doesn't look obvious to me that it's all easy to do in relational DB. Mongo has limitation 16 megs per document: http://docs.mongodb.org/master/reference/limits/ — rsutormin, Sep 12 '15 at 05:06
@rsutormin I see your problem, I have a few thoughts about it however this particular discussion is shifting away from your current question and continuing a discussion in this context isnt the best for StackOverflow. Ill be in the Java chat room https://chat.stackoverflow.com/rooms/139/java for the next hour or 2 if you would like to chat more about alternatives. — ug_, Sep 12 '15 at 05:24
@rsutormin I think the "database" you're looking for is really a filesystem :P Most people store >1gb strings as a file and store a path to the file inside the JSON or database record. Just use a hash or random value as the filename and you're golden. — Navin, Sep 13 '15 at 01:14

rsutormin · Accepted Answer · 2015-09-15T00:28:32.917

Now I know that JSON format is not the best solution for documents with very long string values. But just in case someone faces similar problem (for instance when there is already given such JSON file and it needs to be transformed into some better format). It means the document should be parsed somehow at least once. So here is my investigation:

1) FasterXML/Jackson token streaming doesn't allow standard way of dealing with long strings (loading them by parts). The only way to process them I've found is to do something as I do in question + to deal with unicode manually.

2) Google/Gson has JsonReader also letting users to process JSON as token steam. There is nextString method (https://github.com/google/gson/blob/master/gson/src/main/java/com/google/gson/stream/JsonReader.java#L816). But there is no way to get it by parts or get any info where is the position of it in JSON file (except couple private methods: https://github.com/google/gson/blob/master/gson/src/main/java/com/google/gson/stream/JsonReader.java#L1317-L1323).

3) fangyidong/Json-simple uses SAX-style push interface. But for Strings there is only one method there: https://github.com/fangyidong/json-simple/blob/master/src/main/java/org/json/simple/parser/ContentHandler.java#L108

4) The only my hope was beckchr/StAXON. Because it transforms JSON into XML and then uses XMLStreamReader. There is a method allowing to read a String by parts: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/javax/xml/stream/XMLStreamReader.java#XMLStreamReader.getTextCharacters%28int%2Cchar%5B%5D%2Cint%2Cint%29 . But unfortunately OutOfMemoryError happens right in JSON parsing during transformation. Here is my code:

private static void useStaxon(File tempJson) throws Exception {
    XMLInputFactory factory = new JsonXMLInputFactory();
    XMLStreamReader reader = factory.createXMLStreamReader(new FileReader(tempJson));
    while (true) {
        if (reader.getEventType() == XMLStreamConstants.END_DOCUMENT)
            break;
        if (reader.isCharacters()) {
            long len = reader.getTextLength();
            String text;
            if (len > 20) {
                char[] buffer = new char[20];
                reader.getTextCharacters(0, buffer, 0, buffer.length);
                text = new String(buffer) + "...";
            } else {
                text = reader.getText();
            }
            System.out.println("String: " + text + " (length=" + len + ")");
        }
        reader.next();
    }
    reader.close();
}

Error stack trace is:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at de.odysseus.staxon.json.stream.impl.Yylex.zzRefill(Yylex.java:346)
    at de.odysseus.staxon.json.stream.impl.Yylex.yylex(Yylex.java:600)
    at de.odysseus.staxon.json.stream.impl.Yylex.nextSymbol(Yylex.java:271)
    at de.odysseus.staxon.json.stream.impl.JsonStreamSourceImpl.next(JsonStreamSourceImpl.java:120)
    at de.odysseus.staxon.json.stream.impl.JsonStreamSourceImpl.peek(JsonStreamSourceImpl.java:250)
    at de.odysseus.staxon.json.JsonXMLStreamReader.consume(JsonXMLStreamReader.java:150)
    at de.odysseus.staxon.json.JsonXMLStreamReader.consume(JsonXMLStreamReader.java:153)
    at de.odysseus.staxon.json.JsonXMLStreamReader.consume(JsonXMLStreamReader.java:183)
    at de.odysseus.staxon.json.JsonXMLStreamReader.consume(JsonXMLStreamReader.java:153)
    at de.odysseus.staxon.json.JsonXMLStreamReader.consume(JsonXMLStreamReader.java:183)
    at de.odysseus.staxon.base.AbstractXMLStreamReader.initialize(AbstractXMLStreamReader.java:216)
    at de.odysseus.staxon.json.JsonXMLStreamReader.initialize(JsonXMLStreamReader.java:87)
    at de.odysseus.staxon.json.JsonXMLStreamReader.<init>(JsonXMLStreamReader.java:78)
    at de.odysseus.staxon.json.JsonXMLInputFactory.createXMLStreamReader(JsonXMLInputFactory.java:150)
    at de.odysseus.staxon.json.JsonXMLInputFactory.createXMLStreamReader(JsonXMLInputFactory.java:45)
    at test20150911.LongStringJsonTest.useStaxon(LongStringJsonTest.java:40)
    at test20150911.LongStringJsonTest.main(LongStringJsonTest.java:35)

5) Final hope was some tool written in C transforming my JSON into BSON first. And with BSON I would try to do some better processing. This one seems to be best known: https://github.com/dwight/bsontools . After I ran "fromjson" command line tool from this package on my 1 Gb JSON file it loads it all into memory (which is horrible) and then did something 10 minutes. I didn't wait until the end actually because 10 minutes is too much for 1 Gb file, isn't it? (note: my java code in question works for less then half a minute).

So final answer is: (1) no, it looks like there is no standard way to achieve the goal in question and (2) using FasterXML/Jackson is probably the best Java solution that could be done in this case.

I don't know if this still matters to you but I have a similar need for parsing really long strings in JSON and the solution I came up with is to break the string up into chunks and store an array of those chunked strings in the JSON. That way you can process the incoming JSON steam with any of the Gson or Jackson streaming APIs. — bhspencer, May 10 '22 at 18:24

score 0 · Answer 2 · answered Sep 13 '15 at 02:35

0

I think you're asking the wrong question.

JSON, like XML or CSV, or any other structured-text representation, has three primary roles: making the data structure human-parseable, permitting generic tools to process many different kinds of data, and facilitating data exchange between systems that may use different internal models.

If you do not need those specific characteristics, structured text is probably the wrong solution. A dedicated binary representation may be much more efficient, and that difference can become huge as the size/complexity of the data grows.

Support the structured-text format for import to and export from your tooling. Internally, though, you should probably be using a data model tuned specifically for the needs of your particular tasks.

answered Sep 13 '15 at 02:35

keshlam

7,931
2
19
33

Ok, it's good idea. But how am I supposed to support import structured-text data with very long strings (for instance in JSON) into my internal representation without possibility of parsing it? My question was about parsing it. Even if it's done once during upload. – rsutormin Sep 13 '15 at 03:05
In that case, see the other answers: Use a parser that does not require having the entire text document in memory at once, and convert it to your internal representation incrementally. In XML, that would mean using an event-based parser (SAX or similar API).I haven't had reason to dig into exactly what parsers/APIs are available for JSON; that may depend in part on what language you're working in. – keshlam Sep 13 '15 at 03:26

score 0 · Answer 3 · answered Sep 15 '15 at 00:49

0

Maybe this is a valid case where you write your own parser?

JSON parsing should be relatively simple using a PushbackReader().

answered Sep 15 '15 at 00:49

Stefan Haustein

18,427
3
36
51

Not that it's wrong answer. No. It's universal one: invent all what you need yourself. It's last thing I would like to spend time on if can avoid it. Because it means you'll have to fix all the bugs that were fixed by authors and community in other libs for years. I don't want to say it's wrong way or impossible. It would be unfortunate waste of time. – rsutormin Sep 15 '15 at 06:43
It's just that some of the solution suggested here seem more complex than doing this, as the JSON grammar is quite simple: http://json.org Another option might be to take an existing simple streaming JSON parser and extend it as needed. – Stefan Haustein Sep 15 '15 at 08:42
1

P.S.: Implemented here: https://github.com/kobjects/json/blob/master/src/main/java/org/kobjects/json/JsonTokenizer.java – Stefan Haustein Sep 15 '15 at 23:05

Really long strings in JSON (>1 gig) with Jackson token stream

3 Answers3