52

I wish to manipulate a standard JSON object to an object where each line must contain a separate, self-contained valid JSON object. See JSON Lines

JSON_file =

[{u'index': 1,
  u'no': 'A',
  u'met': u'1043205'},
 {u'index': 2,
  u'no': 'B',
  u'met': u'000031043206'},
 {u'index': 3,
  u'no': 'C',
  u'met': u'0031043207'}]

To JSONL:

{u'index': 1, u'no': 'A', u'met': u'1043205'}
{u'index': 2, u'no': 'B', u'met': u'031043206'}
{u'index': 3, u'no': 'C', u'met': u'0031043207'}

My current solution is to read the JSON file as a text file and remove the [ from the beginning and the ] from the end. Thus, creating a valid JSON object on each line, rather than a nested object containing lines.

I wonder if there is a more elegant solution? I suspect something could go wrong using string manipulation on the file.

The motivation is to read json files into RDD on Spark. See related question - Reading JSON with Apache Spark - `corrupt_record`

dreftymac
  • 31,404
  • 26
  • 119
  • 182
LearningSlowly
  • 8,641
  • 19
  • 55
  • 78
  • 1
    That's not valid JSON input, nor valid JSON output. You are handling *Python objects* here, not JSON serialisation. Even if your output was valid JSON, it would not be valid JSONL because you have *trailing commas*. – Martijn Pieters Aug 12 '16 at 10:01
  • Also, if the objects in the output would be valid JSON, there would be no trailing commas. –  Aug 12 '16 at 10:03

6 Answers6

80

Your input appears to be a sequence of Python objects; it certainly is not valid a JSON document.

If you have a list of Python dictionaries, then all you have to do is dump each entry into a file separately, followed by a newline:

import json

with open('output.jsonl', 'w') as outfile:
    for entry in JSON_file:
        json.dump(entry, outfile)
        outfile.write('\n')

The default configuration for the json module is to output JSON without newlines embedded.

Assuming your A, B and C names are really strings, that would produce:

{"index": 1, "met": "1043205", "no": "A"}
{"index": 2, "met": "000031043206", "no": "B"}
{"index": 3, "met": "0031043207", "no": "C"}

If you started with a JSON document containing a list of entries, just parse that document first with json.load()/json.loads().

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
61

The jsonlines package is made exactly for your use case:

import jsonlines

items = [
    {'a': 1, 'b': 2},
    {'a', 123, 'b': 456},
]
with jsonlines.open('output.jsonl', 'w') as writer:
    writer.write_all(items)

(Yes, I wrote it years after you posted your original question.)

wouter bolsterlee
  • 3,879
  • 22
  • 30
10

A simple way to do this is with the jq command in your terminal.

To install jq on Debian and derivatives:

sudo apt-get install jq

CentOS and RHEL users should run:

sudo yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install jq -y

Basic usage:

jq -c '.[]' some_json.json >> output.jsonl

If you need to handle with huge files, I strongly recommend to use the --stream flag. This will make jq parse your JSON content in streaming mode.

jq -c --stream '.[]' some_json.json >> output.json

But, if you need to do this operation into a Python file, you can use bigjson, a useful library that parses the JSON in streaming mode:

pip3 install bigjson

To read a huge JSON file (in my case, it was 40 GB):

import bigjson

# Reads JSON file in streaming mode
with open('input_file.json', 'rb') as f:
    json_data = bigjson.load(f)

    # Open output file
    with open('output_file.jsonl', 'w') as outfile:
        # Iterates over input json
        for data in json_data:
            # Converts json to a Python dict
            dict_data = data.to_python()

            # Saves the output to output file
            outfile.write(json.dumps(dict_data)+"\n")

If you want, try to parallelize this code aiming to improve performance. Post the result here :)

Documentation and source code: bigjson

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jonas Ferreira
  • 251
  • 3
  • 8
  • As of now this answer only works with Debian and derivatives. Are there otheer possible installation instructions for other operating systems? – TheTechRobo the Nerd Mar 19 '21 at 15:53
  • Yes, but is a quite long, so, follow this link to install on RHEL/CentOS: https://www.cyberithub.com/how-to-install-jq-json-processor-on-rhel-centos-7-8/ – Jonas Ferreira Mar 19 '21 at 17:39
0

Note that a JSONL file is a compacted JSON file. You may need to pass separators without spaces:

with open(output_file_jsonl, 'a', encoding ='utf8') as json_file:
    for elem in rs:
        json_file.write(json.dumps(dict(elem), separators=(',', ':'), cls=DateTimeEncoder))
        json_file.write('\n')
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Laya
  • 1
0

This is an edit to this answer which takes into account the possibility of special symbols or using a different alphabet in the JSONL file. For example, I use Cyrillic and without the encoding and ensure_ascii parameters edited, I get really ugly results. I thought it could be useful:

with open('output.jsonl', 'w', encoding='utf8') as outfile:
    for entry in dataset_donut:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write('\n')
Yana
  • 785
  • 8
  • 23
-1

If you don't want a library, it's easy enough to do using JSON directly.

Source

[
    {"index": 1,"no": "A","met": "1043205"},
    {"index": 2,"no": "B","met": "000031043206"},
    {"index": 3,"no": "C","met": "0031043207"}
]

Code

import json

with open("test.json", 'r') as infile:
    data = json.load(infile)
    if len(data) > 0:
        print(json.dumps([t for t in data[0]]))
        for record in data:
            print(json.dumps([v for (k,v) in record.items()]))

Result

["index", "no", "met"]
[1, "A", "1043205"]
[2, "B", "000031043206"]
[3, "C", "0031043207"]
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Konchog
  • 1,920
  • 19
  • 23
  • Your result is a valid JSONL in traditional CSV format, but the question clearly wants an output in key, value pairs e.g. `{"index": 1,"no": "A","met": "1043205"}`. – Banty Aug 18 '23 at 09:48