0

I am looking to load a large json file over 100GB+. The objects in this file aren't static and are almost never the same. I found this crate called nop-json https://crates.io/crates/nop-json/2.0.5. but I was unable to get it to the work in the way that I want. This is my current solution but it feels a bit like cheating.

    let file = File::open("./file.json")?;
    let reader = BufReader::new(file);
    for line in reader.lines() {
        //code
    }

I am reading the file like a text file and itterating that way. the problem is that with this solution I am reading it as a string and that it loads the entire file into memory.

I am new to rust so I am looking for some help on this problem. I have a succesful implementation in python and it works great but its too slow.

edit:

Thank you for the replies so far here is some more information:

My *.json file has 1 array containing milions of objects. example:

[
    {
        "foo": "bar",
        "bar": "foor"
    },
    {
        "foo": "bar",
        "bar": "foor"
    },
    {
        "foo": "bar",
        "bar": "foor"
    },
    {
        "foo": "bar",
        "bar": "foor"
    }
]

etc..

The problem with reading the file as a text file this way is that not every object is 1 line exactly. The amount of lines for an object is not the same.

Some possible soltuion might be to read a chunk of the file and then check where the json object ended via something like a pattern }, {. But this seems inaficiant.

  • possible duplicate. please see this answer: https://stackoverflow.com/questions/45882329/read-large-files-line-by-line-in-rust – Urban48 Mar 01 '23 at 21:08
  • 4
    If the objects in the file are guaranteed to be line-delimited, reading the file line-by-line and deserializing each line individually is a perfectly fine approach. When you say "... and that loads the entire file into memory." it sounds like the file is just one giant line, but then why use `BufReader` and a loop in the first place? If it's not, and you're reading it line-by-line, you are not actually reading "the entire file into memory", it's just the OS caching it for future reads in unoccupied memory – user2722968 Mar 01 '23 at 21:08
  • 1
    Does this answer your question? [Read large files line by line in Rust](https://stackoverflow.com/questions/45882329/read-large-files-line-by-line-in-rust) – Urban48 Mar 01 '23 at 21:09
  • 1
    I don't think the question currently has enough information for a good answer. The viability of any solution depends very much on the the particulars of your json file, particularly if we can decompose the json file into parts that can be individually serialized. Can the question be updated to include a small representative example of the json file? – effect Mar 01 '23 at 21:17
  • If this is actually JSON and not line-delimited JSON ([NDJSON](http://ndjson.org)) then there is only one line to read. If you need a streaming JSON parser, that's a whole other thing. – tadman Mar 01 '23 at 21:37
  • @tadman Even if you had a streaming json parser - JSON does not define offsets or jump tables or anything similar. So for every single value you query you would have to read the **entire** json until the point where the variable lies. And even then you would technically have to read the rest to verify that the file is actually valid json. JSON is simply the wrong data format for huge single objects. – Finomnis Mar 01 '23 at 22:43
  • @Finomnis It's far from an ideal format for this, but there are degrees of finesse you can apply here, as [there's variation in the performance of different parsers](https://github.com/serde-rs/json-benchmark), though you're right that the size is going to be punishing no matter what. – tadman Mar 01 '23 at 22:49
  • I have updated the question with somre more information @user2722968. You're solution would have worked if it was actually line by line sorry for the misinformation. – d-dutchview Mar 01 '23 at 22:52
  • @d-dutchview Is the data format fix? If yes, I'm sorry that I have to inform you that JSON is simply the wrong format to represent data in the size of 100GB ... You pretty much have to implement your own parser, because no sane official parser will provide this functionality. Your own parser will have to **guess** that the rest of the file is valid, because technically, in order to accept the first item of the array, your parser must verify that the array ends with a `]`. And it can't do that in reverse, because the `]` at the end could be for another array. – Finomnis Mar 01 '23 at 22:53
  • I know @Finomnis. Its extremely tedious and annoying but I am trying to find a solution here. I have updated my question with a possible solution dervied from you're answer. – d-dutchview Mar 01 '23 at 22:55
  • @d-dutchview What if you have a nested object? Then `},{` could exist **within** the object. JSON is very context sensitive, and everything is variable length. You really do have to read everything to verify the first element. – Finomnis Mar 01 '23 at 22:58
  • correct. But I am not trying to reinvent the wheel here. I was hoping that there would have been a well made a robust json streamer. But as it stands right now there is no known solution for this? – d-dutchview Mar 01 '23 at 23:00
  • I mean sure, if you know beforehand that your data structure is **guaranteed** to be a `List[Map[String, String]]` (pseudocode annotation), then this problem gets a lot easier. But then it's no longer JSON. Is that the case? – Finomnis Mar 01 '23 at 23:00
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/252241/discussion-between-d-dutchview-and-finomnis). – d-dutchview Mar 01 '23 at 23:02
  • Maybe a duplicate of [How can I stream elements from inside a JSON array using serde_json?](https://stackoverflow.com/q/68641157) – Marcono1234 Aug 25 '23 at 22:00

1 Answers1

4

First off, if you accept normal full JSON, your problem is really hard.

So I assume the following:

  • Your file always starts with a [.
  • Then, an arbitrary number of valid JSON strings follow, separated by ,.
  • After the JSON strings there is another ].
  • Every single JSON string is small enough to be parsed and held in memory in its entirety.

Meaning, we now have a bunch of streamable separate JSON objects that are wrapped by our own array representation.

With that, we can utilize serde_json and a little bit of glue to parse the file value by value:

use std::error::Error;
use std::io::Read;

use serde_json::{Deserializer, Value};

const JSON_FILE: &[u8] = br#"[
    {
        "foo": "bar",
        "bar": "foor"
    },
    {
        "foo": "bar",
        "bar": "foor"
    },
    {
        "foo": "bar",
        "bar": "foor"
    },
    {
        "foo": "bar",
        "bar": "foor"
    }
]"#;

fn open_file() -> impl Read {
    JSON_FILE
}

fn take_json_value(input_stream: &mut dyn Read) -> Result<Value, Box<dyn Error>> {
    Ok(Deserializer::from_reader(input_stream)
        .into_iter()
        .next()
        .ok_or("Expected a JSON value!")??)
}

fn main() {
    // Is of type `impl Read`, and can only be read once.
    // (to reproduce the situation of reading a file)
    let mut input_stream = open_file();

    // Skip initial `[`
    let mut skipped = 0u8;
    input_stream
        .read_exact(std::slice::from_mut(&mut skipped))
        .unwrap();
    assert_eq!(skipped, b'[');

    loop {
        let value = take_json_value(&mut input_stream).unwrap();

        println!("- {}", value);

        // Skip `,` after the value
        input_stream
            .read_exact(std::slice::from_mut(&mut skipped))
            .unwrap();
        if skipped != b',' {
            break;
        }
    }

    // Verify that the ending `]` exists
    let mut leftover_data = vec![b'[', skipped];
    input_stream.read_to_end(&mut leftover_data).unwrap();
    serde_json::from_slice::<[u8; 0]>(&leftover_data).unwrap();
}
- Object {"bar": String("foor"), "foo": String("bar")}
- Object {"bar": String("foor"), "foo": String("bar")}
- Object {"bar": String("foor"), "foo": String("bar")}
- Object {"bar": String("foor"), "foo": String("bar")}
Finomnis
  • 18,094
  • 1
  • 20
  • 27
  • correct. I have updated the question with some more information. – d-dutchview Mar 01 '23 at 22:51
  • @d-dutchview Also, use `cargo run --release` or `cargo build --release` if you need performance. I just mention it because we regularly get questions here about why Rust is slow - the Rust compiler performs very little optimization unless you add the `--release` flag. But you probably already know. Just in case. – Finomnis Mar 02 '23 at 00:42