1

I have a directory full of json files, like so:

json/
    checkpoint_01.json
    checkpoint_02.json
    ...
    checkpoint_100.json

where each file has thousands of json objects, dumped line by line.

{"playlist_id": "37i9dQZF1DZ06evO2dqn7O", "user_id": "spotify", "sentence": ["Lil Wayne", "Wiz Khalifa", "Imagine Dragons", "Logic", "Ty Dolla $ign", "X Ambassadors", "Machine Gun Kelly", "X Ambassadors", "Bebe Rexha", "X Ambassadors", "Jamie N Commons", "X Ambassadors", "Eminem", "X Ambassadors", "Jamie N Commons", "Skylar Grey", "X Ambassadors", "Zedd", "Logic", "X Ambassadors", "Imagine Dragons", "X Ambassadors", "Jamie N Commons", "A$AP Ferg", "X Ambassadors", "Tom Morello", "X Ambassadors", "The Knocks", "X Ambassadors"]}
{"playlist_id": "37i9dQZF1DZ06evO1A0kr6", "user_id": "spotify", "sentence": ["RY X", "ODESZA", "RY X", "Thomas Jack", "RY X", "Rhye", "RY X"]} 
(...)

I know I can combine all files into one, like so:

def combine():
    read_files = glob.glob("*.json")
    with open("merged_playilsts.json", "wb") as outfile:
        outfile.write('[{}]'.format(
            ','.join([open(f, "rb").read() for f in read_files])))

but at the end I need to parse one big json file, using the following script:

parser.py

"""
Passes extraction output into `word2vec`
and prints results as JSON.
"""    
from __future__ import absolute_import, unicode_literals
import json
import click    
from numpy import array as np_array    
import gensim
    
class LineGenerator(object):
    """Reads a sentence file, yields numpy array-wrapped sentences
    """

    def __init__(self, fh):
        self.fh = fh

    def __iter__(self):
        for line in self.fh.readlines():
            yield np_array(json.loads(line)['sentence'])

 
def serialize_rankings(rankings):
    """Returns a JSON-encoded object representing word2vec's
    similarity output.
    """  

    return json.dumps([
        {'artist': artist, 'rel': rel}
        for (artist, rel)
        in rankings
    ])
 
@click.command()
@click.option('-i', 'input_file', type=click.File('r', encoding='utf-8'),
              required=True)
@click.option('-t', 'term', required=True)
@click.option('--min-count', type=click.INT, default=5)
@click.option('-w', 'workers', type=click.INT, default=4)
def cli(input_file, term, min_count, workers):
    # create word2vec
    model = gensim.models.Word2Vec(min_count=min_count, workers=workers)
    model.build_vocab(LineGenerator(input_file))

    try:
        similar = model.most_similar(term)
        click.echo( serialize_rankings(similar) )
    except KeyError:
        # really wish this was a more descriptive error
        exit('Could not parse input: {}'.format(exc))
  
if __name__ == '__main__':
    cli()

QUESTION:

So, how do I combine ALL json objects from json/ folder into one single file, ending up with one json object per line?

Note: memory is an issue here, because all files amount to 4 gigabytes.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • this answer will probably help: https://stackoverflow.com/questions/12451431/loading-and-parsing-a-json-file-with-multiple-json-objects-in-python you read all the files and append all the lines of the files to a dictionary. at the end you dump your dictionary to a single json file – Stefano Gallotti Mar 23 '19 at 03:53

1 Answers1

0

If memory is an issue, you most likely want to use a generator to load each line on demand. The following solution assumes Python 3:

# get a list of file paths, you can do this via os.listdir or glob.glob... however you want.
my_filenames = [...]

def stream_lines(filenames):
    for name in filenames:
        with open(name) as f:
            yield from f

lines = stream_lines(my_filenames)

def stream_json_objects_while_ignoring_errors(lines):
    for line in lines:
        try:
            yield json.loads(line)
        except Exception as e:
            print(“ignoring invalid JSON”)

json_objects = stream_json_objects_while_ignoring_errors(lines)

for obj in json_objects:
    # now you can loop over the json objects without reading all the files into memory at once
    # example:
    print(obj["sentence"])

Note that I’ve left out details such as error handling and dealing with empty lines or file open failures for simplicity.

Joel Cornett
  • 24,192
  • 9
  • 66
  • 88
  • Yes, I should have stated that – Joel Cornett Mar 23 '19 at 05:01
  • could you please just add to your answer, in the loop, one valid operation, such as print each object "sentence", for instance, for the sake of completeness? –  Mar 23 '19 at 05:02
  • i still dont know how you can access specific dict values, by your example, but thanks –  Mar 23 '19 at 05:09
  • @outkast I’ve edited the print statement. Does that help? – Joel Cornett Mar 23 '19 at 05:12
  • kind of. I get `TypeError: list indices must be integers or slices, not str` with your example...`print(obj["sentence"])` –  Mar 23 '19 at 05:13
  • I used `my_filenames = glob.glob("*.json")` –  Mar 23 '19 at 05:15
  • That means that at least some (maybe all) of your lines contain JSON arrays, not objects. I think maybe the example file contents you provided in the question are not entirely accurate or there is some other issue with the file format. – Joel Cornett Mar 23 '19 at 05:19
  • when I `print (obj)`, at some point it breaks with `json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 1027 (char 1026)`. maybe this is the reason? if you could handle this error, I would most certainly upvote your answer. –  Mar 23 '19 at 05:21
  • By “handle” what do you mean? What is the expected behavior when there is bad or ambiguous input? – Joel Cornett Mar 23 '19 at 05:23
  • I think ignoring the whole `obj` (line) and moving to the next would be the best way to handle it. it's such a huge file that a few lines won't be missed at all –  Mar 23 '19 at 05:25