5

Cross posting from Julia Discourse in case anyone here has any leads.

I’m just looking for some insight into why the below code is returning a dataframe containing just the first line of my json file. If you’d like to try working with the file I’m working with, you can download the aminer_papers_0.zip from the Microsoft Open Academic Graph site, I’m using the first file in that group of files.

using JSON3, DataFrames, CSV
file_name = "path/aminer_papers_0.txt"
json_string = read(file_name, String)
js = JSON3.read(json_string)
df = DataFrame([js])

The resulting DataFrame has just one line, but the column titles are correct, as is the first line. To me the mystery is why the rest isn’t getting processed. I think I can rule out that read() is only reading the first JSON object, because I can index into the resulting object and see many JSON objects:

enter image description here

My first guess was maybe the newline \n was causing escape issues, and tried to use chomp to get rid of them, but couldn’t get it to work.

Anyway - any help would be greatly appreciated!

clibassi
  • 67
  • 3
  • can you share the file you are processing? The whole zip would take 9h to download on my machine. – Bogumił Kamiński May 14 '21 at 14:29
  • Yes - I will create a sharing link soon. The individual files are also quite big, though, so it will take a few minutes to put in a place I can share it from. – clibassi May 14 '21 at 14:56
  • Okay - I think you should be able to get it here: https://www.dropbox.com/s/mfv8e7uc2267786/aminer_papers_0.txt?dl=0 – clibassi May 14 '21 at 15:22
  • OK - I see that the file has 9GB. Unfortunately I do not have enough RAM to process it. Maybe @quinnj will be able to have a look at your problem. I will ping him. Have you tried using https://github.com/JuliaData/JSONTables.jl? – Bogumił Kamiński May 14 '21 at 15:26
  • OKay - no trouble. And I haven't given JSONTables a proper try, yet. – clibassi May 14 '21 at 15:29
  • Just tried - and I think the deeper nesting of this JSON might be preventing it from working – clibassi May 14 '21 at 15:31

1 Answers1

5

I think the problem is that the file is in JSON Lines format, and the JSON3 library only returns the first valid JSON value that it finds at the start of a string unless told otherwise.

tl;dr

Call JSON3.read with the keyword argument jsonlines=true.

Why?

By default, JSON3 interprets a string passed to its read function as a single "JSON text", defined by RFC 8259 section 1.3.2:

A JSON text is a serialized value....

(My emphasis on the use of the indefinite singular article "a.") A "JSON value" is defined in section 1.3.3:

A JSON value MUST be an object, array, number, or string, or one of the following three literal names: false, null, true.

A string with multiple JSON values in it is technically multiple "JSON texts." It is up to the parser to determine what part of the string argument you give it is a JSON text, and the authors of JSON3 chose as the default behavior to parse from the start of the string to the end of the first valid JSON value.

In order to get JSON3 to read the string as multiple JSON values, you have to give it the keyword option jsonlines=true, which is documented as:

jsonlines: A Bool indicating that the json_str contains newline delimited JSON strings, which will be read into a JSON3.Array of the JSON values. See jsonlines for reference. [default false]

Example

Take for example this simple string:

two_values = "3.14\n2.72"

Each one of these lines is a valid JSON serialization of a number. However, when passed to JSON3.read, only the first is parsed:

using JSON3
@assert JSON3.read(two_values) == 3.14

Using jsonlines=true, both values are parsed and returned as a JSON3.Array struct:

@assert JSON3.read(two_values, jsonlines=true) == [3.14, 2.72]

Other Packages

The JSON.jl library, which people might use by default given the name, does not implement parsing of JSON Lines strings at all, leaving it up to the caller to properly split the string as needed:

using JSON
JSON.parse(two_values)
# ERROR: Expected end of input
# Line: 1
# Around: ...3.14 2.72...
#                 ^

A simple way to implement reading multiple values is to use eachline:

@assert [JSON.parse(line) for line in eachline(IOBuffer(two_values))] == [3.14, 2.72]
PaSTE
  • 4,050
  • 18
  • 26
  • 1
    I think you have to be careful about using `eachline`, as it will split JSON strings that contain newlines themselves, such as `{"a": "new\nline"}`. To avoid this, you'd have to write your own JSON parser... easier to just use JSON3. – BallpointBen May 15 '21 at 18:45
  • Great catch! Except technically a string value with an unescaped control character, like \n, is invalid JSON by [section 7 of RFC 8259](https://datatracker.ietf.org/doc/html/rfc8259#section-7), so that example should fail when parsed by a JSON parser that conforms to the standard. I checked, and JSON3 will not accept such a string as a value. – PaSTE May 15 '21 at 21:04
  • To clarify, if the code sequence U+000A (line feed) appears in a string value, then that is an invalid JSON string. However, line feeds can appear outside string values in valid JSON (they are considered whitespace and ignored), and the `eachline` method above _will_ fail when reading text formatted like that. Like BallpointBen says, it's better to just use a parser like `JSON3` that correctly handles the JSON Lines format. – PaSTE May 15 '21 at 21:16