26

How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.

My fist try was to use jq, but it looks like jq -s is not optimized for a large input.

jq -s -r '[.[][]]' *.js 

This command works, but takes way too long to complete and I really would like to solve this with Python.

Here is my current code:

def concatFiles(outName, inFileNames):
    def listGenerator():
        for inName in inFileNames:
            with open(inName, 'r') as f:
                for item in json.load(f):
                    yield item

    with open(outName, 'w') as f:
        json.dump(listGenerator(), f)

I'm getting:

TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable

Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?

Sebastian Wagner
  • 2,308
  • 2
  • 25
  • 32
  • 1
    How about just textually concatenating the documents inserting commas between? – bereal Feb 09 '14 at 19:30
  • You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation. – Sebastian Wagner Feb 09 '14 at 19:37
  • how large are the files actually? could it be that holding the complete serialized data is larger than your memory ? – Alexander Oh Feb 09 '14 at 20:03
  • Yes, that's why calling list(..) is not going to work. – Sebastian Wagner Feb 09 '14 at 20:08
  • Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with `[]`. – Joel Cornett Jun 05 '14 at 06:28

5 Answers5

36

As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array

# Since simplejson is backwards compatible, you should feel free to import
# it as `json`
import simplejson as json
json.dumps((i*i for i in range(10)), iterable_as_array=True)

result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Nick Babcock
  • 6,111
  • 3
  • 27
  • 43
24

You should derive from list and override __iter__ method.

import json

def gen():
    yield 20
    yield 30
    yield 40

class StreamArray(list):
    def __iter__(self):
        return gen()

    # according to the comment below
    def __len__(self):
        return 1

a = [1,2,3]
b = StreamArray()

print(json.dumps([1,a,b]))

Result is [1, [1, 2, 3], [20, 30, 40]].

Vadim Pushtaev
  • 2,332
  • 18
  • 32
  • 5
    With Python 2.7.8, the `StreamArray` class also has to override the `__len__` method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the `__iter__` method – Tristan Mar 25 '15 at 08:56
  • Please note, that this solution creates invalid JSON when used with *indent* parameter and the iterable is "empty". `json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}` – Mišo May 25 '16 at 13:26
  • 1
    I believe we should not `return 1` for length if the iterable is "empty". – Vadim Pushtaev May 25 '16 at 16:18
12

This universal solution is useful also for really huge data" if a result string couldn't fit easily in memory, but it can be still easily written to a stream from a JSON iterator. (This is better than "import simplejson ..." that can help, but not too much). Tested with Python 2.7, 3.0, 3.3, 3.6, 3.10.0a7. Two times faster than simplejson. Small memory footprint. Written unit tests.

import itertools

class SerializableGenerator(list):
    """Generator that is serializable by JSON"""

    def __init__(self, iterable):
        tmp_body = iter(iterable)
        try:
            self._head = iter([next(tmp_body)])
            self.append(tmp_body)
        except StopIteration:
            self._head = []

    def __iter__(self):
        return itertools.chain(self._head, *self[:1])

Normal usage (little memory for input, but still make the whole output string in memory):

>>> json.dumps(SerializableGenerator(iter([1, 2])))
"[1, 2]"
>>> json.dumps(SerializableGenerator(iter([])))
"[]"

For really huge data it can be used as generator of JSON chunks in Python 3 and still use very little memory:

>>> iter_json = json.JSONEncoder().iterencode(SerializableGenerator(iter(range(1000000))))
>>> for chunk in iter_json:
...     stream.write(chunk)
# or a naive examle
>>> tuple(iter_json)
('[1', ', 2', ... ', 1000000', ']')

The class is used by a normal JSONEncoder().encode(...) internally by json.dumps(...) or by an explicit JSONEncoder().iterencode(...) to get an generator of JSON chunks instead.

(The function iter() in the examples is not necessary for it to work, only to demonstrate a non trivial input that has no known length.)


Test:

import unittest
import json
# from ?your_module? import SerializableGenerator 


class Test(unittest.TestCase):

    def combined_dump_assert(self, iterable, expect):
        self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)

    def combined_iterencode_assert(self, iterable, expect):
        encoder = json.JSONEncoder().iterencode
        self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)

    def test_dump_data(self):
        self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')

    def test_dump_empty(self):
        self.combined_dump_assert(iter([]), '[]')

    def test_iterencode_data(self):
        self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))

    def test_iterencode_empty(self):
        self.combined_iterencode_assert(iter([]), ('[]',))

    def test_that_all_data_are_consumed(self):
        gen = SerializableGenerator(iter([1, 2]))
        list(gen)
        self.assertEqual(list(gen), [])

This solution is inspired by three older answers: Vadim Pushtaev (some problem with empty iterable) and user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).

Important differences from these solutions are:

  • Important methods __len__, __bool__ and other are inherited consistently from a list class meaningfully initialized.
  • The first item of the input is evaluated immediately by __init__ (not lazily triggered by many other methods) The list class can know at once if the iterator is empty or not. A non empty list contains one item with the generator or the list is empty if the iterator is empty.
  • The correct implementation of length for an empty iterator is important for the JSONEncoder.iterencode(...) method.
  • All other methods give a meaningful output, e.g. __repr__:
   >>> SerializableGenerator((x for x in range(3)))
   [<generator object <genexpr> at 0x........>]

An advantage of this solution is that a standard JSON serializer can be used. If nested generators should be supported then the solution with simplejson is probably the best and it has also similar variant with iterencode(...).


Stub *.pyi for strong typing:

from typing import Any, Iterable, Iterator

class SerializableGenerator(list):
    def __init__(self, iterable: Iterable[Any]) -> None: ...
    def __iter__(self) -> Iterator: ...
hynekcer
  • 14,942
  • 6
  • 61
  • 99
3

Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:

  1. The suggestion that self.__tail__ might be immutable
  2. len(StreamArray(some_gen)) is either 0 or 1

.

class StreamArray(list):

    def __init__(self, gen):
        self.gen = gen

    def destructure(self):
        try:
            return self.__head__, self.__tail__, self.__len__
        except AttributeError:
            try:
                self.__head__ = self.gen.__next__()
                self.__tail__ = self.gen
                self.__len__ = 1 # A lie
            except StopIteration:
                self.__head__ = None
                self.__tail__ = []
                self.__len__ = 0
            return self.__head__, self.__tail__, self.__len__

    def rebuilt_gen(self):
        def rebuilt_gen_inner():
            head, tail, len_ = self.destructure()
            if len_ > 0:
                yield head
            for elem in tail:
                yield elem
        try:
            return self.__rebuilt_gen__
        except AttributeError:
            self.__rebuilt_gen__ = rebuilt_gen_inner()
            return self.__rebuilt_gen__

    def __iter__(self):
        return self.rebuilt_gen()

    def __next__(self):
        return self.rebuilt_gen()

    def __len__(self):
        return self.destructure()[2]

Single use only!

user1158559
  • 1,954
  • 1
  • 18
  • 23
  • +1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine. – hynekcer Oct 20 '17 at 03:24
  • Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from `itertools`. Very pleased to know that this works as is. – user1158559 Oct 21 '17 at 11:09
2

I was getting this error in the map-reduce task with mrjob. It got resolved after handling the iterator properly.

If you are not handling iterator yield by the mapper you will get this error.