Python conversion from JSON to JSONL

Question

I wish to manipulate a standard JSON object to an object where each line must contain a separate, self-contained valid JSON object. See JSON Lines

JSON_file =

[{u'index': 1,
  u'no': 'A',
  u'met': u'1043205'},
 {u'index': 2,
  u'no': 'B',
  u'met': u'000031043206'},
 {u'index': 3,
  u'no': 'C',
  u'met': u'0031043207'}]

To JSONL:

{u'index': 1, u'no': 'A', u'met': u'1043205'}
{u'index': 2, u'no': 'B', u'met': u'031043206'}
{u'index': 3, u'no': 'C', u'met': u'0031043207'}

My current solution is to read the JSON file as a text file and remove the [ from the beginning and the ] from the end. Thus, creating a valid JSON object on each line, rather than a nested object containing lines.

I wonder if there is a more elegant solution? I suspect something could go wrong using string manipulation on the file.

The motivation is to read json files into RDD on Spark. See related question - Reading JSON with Apache Spark - `corrupt_record`

That's not valid JSON input, nor valid JSON output. You are handling *Python objects* here, not JSON serialisation. Even if your output was valid JSON, it would not be valid JSONL because you have *trailing commas*. — Martijn Pieters, Aug 12 '16 at 10:01
Also, if the objects in the output would be valid JSON, there would be no trailing commas. — , Aug 12 '16 at 10:03

Martijn Pieters · Accepted Answer · 2016-08-12T10:08:11.057

Your input appears to be a sequence of Python objects; it certainly is not valid a JSON document.

If you have a list of Python dictionaries, then all you have to do is dump each entry into a file separately, followed by a newline:

import json

with open('output.jsonl', 'w') as outfile:
    for entry in JSON_file:
        json.dump(entry, outfile)
        outfile.write('\n')

The default configuration for the json module is to output JSON without newlines embedded.

Assuming your A, B and C names are really strings, that would produce:

{"index": 1, "met": "1043205", "no": "A"}
{"index": 2, "met": "000031043206", "no": "B"}
{"index": 3, "met": "0031043207", "no": "C"}

If you started with a JSON document containing a list of entries, just parse that document first with json.load()/json.loads().

wouter bolsterlee · Answer 2 · 2023-05-01T06:21:30.793

61

The jsonlines package is made exactly for your use case:

import jsonlines

items = [
    {'a': 1, 'b': 2},
    {'a', 123, 'b': 456},
]
with jsonlines.open('output.jsonl', 'w') as writer:
    writer.write_all(items)

(Yes, I wrote it years after you posted your original question.)

edited May 01 '23 at 06:21

answered Sep 18 '18 at 20:09

wouter bolsterlee

3,879
22
30

1

**items** is a list – Jose R. Zapata Jun 22 '21 at 16:35

score 10 · Answer 3 · edited Apr 29 '23 at 21:56

A simple way to do this is with the jq command in your terminal.

To install jq on Debian and derivatives:

sudo apt-get install jq

CentOS and RHEL users should run:

sudo yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install jq -y

Basic usage:

jq -c '.[]' some_json.json >> output.jsonl

If you need to handle with huge files, I strongly recommend to use the --stream flag. This will make jq parse your JSON content in streaming mode.

jq -c --stream '.[]' some_json.json >> output.json

But, if you need to do this operation into a Python file, you can use bigjson, a useful library that parses the JSON in streaming mode:

pip3 install bigjson

To read a huge JSON file (in my case, it was 40 GB):

import bigjson

# Reads JSON file in streaming mode
with open('input_file.json', 'rb') as f:
    json_data = bigjson.load(f)

    # Open output file
    with open('output_file.jsonl', 'w') as outfile:
        # Iterates over input json
        for data in json_data:
            # Converts json to a Python dict
            dict_data = data.to_python()

            # Saves the output to output file
            outfile.write(json.dumps(dict_data)+"\n")

If you want, try to parallelize this code aiming to improve performance. Post the result here :)

Documentation and source code: bigjson

As of now this answer only works with Debian and derivatives. Are there otheer possible installation instructions for other operating systems? — TheTechRobo the Nerd, Mar 19 '21 at 15:53
Yes, but is a quite long, so, follow this link to install on RHEL/CentOS: https://www.cyberithub.com/how-to-install-jq-json-processor-on-rhel-centos-7-8/ — Jonas Ferreira, Mar 19 '21 at 17:39

score 0 · Answer 4 · edited Apr 29 '23 at 22:01

0

Note that a JSONL file is a compacted JSON file. You may need to pass separators without spaces:

with open(output_file_jsonl, 'a', encoding ='utf8') as json_file:
    for elem in rs:
        json_file.write(json.dumps(dict(elem), separators=(',', ':'), cls=DateTimeEncoder))
        json_file.write('\n')

edited Apr 29 '23 at 22:01

Peter Mortensen

30,738
21
105
131

answered Nov 04 '22 at 13:14

Laya

1

score 0 · Answer 5 · answered Jan 11 '23 at 09:20

This is an edit to this answer which takes into account the possibility of special symbols or using a different alphabet in the JSONL file. For example, I use Cyrillic and without the encoding and ensure_ascii parameters edited, I get really ugly results. I thought it could be useful:

with open('output.jsonl', 'w', encoding='utf8') as outfile:
    for entry in dataset_donut:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write('\n')

score -1 · Answer 6 · edited Apr 29 '23 at 21:58

-1

If you don't want a library, it's easy enough to do using JSON directly.

Source

[
    {"index": 1,"no": "A","met": "1043205"},
    {"index": 2,"no": "B","met": "000031043206"},
    {"index": 3,"no": "C","met": "0031043207"}
]

Code

import json

with open("test.json", 'r') as infile:
    data = json.load(infile)
    if len(data) > 0:
        print(json.dumps([t for t in data[0]]))
        for record in data:
            print(json.dumps([v for (k,v) in record.items()]))

Result

["index", "no", "met"]
[1, "A", "1043205"]
[2, "B", "000031043206"]
[3, "C", "0031043207"]

edited Apr 29 '23 at 21:58

Peter Mortensen

30,738
21
105
131

answered Apr 20 '22 at 13:05

Konchog

1,920
19
23

Your result is a valid JSONL in traditional CSV format, but the question clearly wants an output in key, value pairs e.g. `{"index": 1,"no": "A","met": "1043205"}`. – Banty Aug 18 '23 at 09:48

Python conversion from JSON to JSONL

6 Answers6

Linked

Related