Memory error while parsing huge JSON file

Question

I'm trying to parse a huge 12 GB JSON file with almost 5 million lines(each one is an object) in python and store it to a database. I'm using ijson and multiprocessing in order to run it faster. Here is the code

def parse(paper):
    global mydata
 
    if 'type' not in paper["venue"]:
        venue = Venues(venue_raw = paper["venue"]["raw"])
        venue.save()
    else:
        venue = Venues(venue_raw = paper["venue"]["raw"], venue_type = paper["venue"]["type"])
        venue.save()
    paper1 = Papers(paper_id = paper["id"],paper_title = paper["title"],venue = venue)
    paper1.save()
            
    paper_authors = paper["authors"]
    paper_authors_json = json.dumps(paper_authors)
    obj = ijson.items(paper_authors_json,'item')
    for author in obj:
        mydata = mydata.append({'author_id': author["id"] , 'venue_raw': venue.venue_raw, 'year' : paper["year"],'number_of_times': 1},ignore_index=True)

if __name__ == '__main__':
    p = Pool(4)
 
    filename = 'C:/Users/dintz/Documents/finaldata/dblp.v12.json'
    with open(filename,encoding='UTF-8') as infile:
        papers = ijson.items(infile, 'item')   
        for paper in papers:
            p.apply_async(parse,(paper,))
    
            
    
    p.close()
    p.join()
            
    
 
    mydata = mydata.groupby(by=['author_id','venue_raw','year'], axis=0, as_index = False).sum()
    mydata = mydata.groupby(by = ['author_id','venue_raw'], axis=0, as_index = False, group_keys = False).apply(lambda x: sum((1+x.year-x.year.min())*numpy.log10(x.number_of_times+1)))
    df = mydata.index.to_frame(index = False)
    df = pd.DataFrame({'author_id':df["author_id"],'venue_raw':df["venue_raw"],'rating':mydata.values[:,2]})
    
    for index, row in df.iterrows():
        author_id = row['author_id']
        venue = Venues.objects.get(venue_raw = row['venue_raw'])
        rating = Ratings(author_id = author_id, venue = venue, rating = row['rating'])
        rating.save()

However I get the following error without knowing the reason

Can somebody help me?

You're probably not going to have a good time trying to use Pandas dataframes for this, if you want to be memory-frugal. — AKX, May 25 '21 at 17:40
Also, using a `multiprocessing.Pool()` means your `mydata` will not be global across those multiple processes. — AKX, May 25 '21 at 17:41
You should probably consider writing a script that turns your data into flat CSV/TSV files that you can import into your (apparently Django-modeled) database using the `LOAD INFILE` equivalent of your database. — AKX, May 25 '21 at 17:43
@ΔημήτριοςΙντζελερ , it is obvious that you are running out of memory. You can't just load 12 GB file into 8 GB ram. you have to find a way to parse it in parts. I suggest looking at this question - https://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python — Tugay, May 25 '21 at 17:56
@Tuqay Your link discusses using `ijson`, which is what OP is already doing. — AKX, May 25 '21 at 17:57

score 1 · Accepted Answer · answered May 25 '21 at 18:09

I've had to make quite some extrapolations and assumptions, but it looks like

you're using Django
you want to populate an SQL database with venue, paper and author data
you want to then do some analysis using Pandas

Populating your SQL database can be done pretty neatly with something like the following.

I added the tqdm package so you get a progress indication.
This assumes there's a PaperAuthor model that links papers and authors.
Unlike the original code, this will not save duplicate Venues in the database.
You can see I replaced get_or_create and create with stubs to make this runnable without the database models (or indeed, without Django), just having the dataset you're using available.

On my machine, this consumes practically no memory, as the records are (or would be) dumped into the SQL database, not into an ever-growing, fragmenting dataframe in memory.

The Pandas processing is left as an exercise for the reader ;-), but I'd imagine it'd involve pd.read_sql() to read this preprocessed data from the database.

import multiprocessing

import ijson
import tqdm


def get_or_create(model, **kwargs):
    # Actual Django statement:
    # return model.objects.get_or_create(**kwargs)
    return (None, True)


def create(model, **kwargs):
    # Actual Django statement:
    # return model.objects.create(**kwargs)
    return None


Venue = "Venue"
Paper = "Paper"
PaperAuthor = "PaperAuthor"


def parse(paper):
    venue_name = paper["venue"]["raw"]
    venue_type = paper["venue"].get("type")
    venue, _ = get_or_create(Venue, venue_raw=venue_name, venue_type=venue_type)
    paper_obj = create(Paper, paper_id=paper["id"], paper_title=paper["title"], venue=venue)
    for author in paper["authors"]:
        create(PaperAuthor, paper=paper_obj, author_id=author["id"], year=paper["year"])


def main():
    filename = "F:/dblp.v12.json"
    with multiprocessing.Pool() as p, open(filename, encoding="UTF-8") as infile:
        for result in tqdm.tqdm(p.imap_unordered(parse, ijson.items(infile, "item"), chunksize=64)):
            pass


if __name__ == "__main__":
    main()

Unfortunatelly I get this error `django.db.utils.OperationalError: database is locked` Seems like I must run it without multiprocessing and just wait for a long time :( — Δημήτριος Ιντζελερ, May 26 '21 at 18:17
You get `database is locked` only on SQLite databases. Try another database such as MySQL or PostgreSQL, or try running SQLite in WAL mode: https://code.djangoproject.com/ticket/24018 — AKX, May 26 '21 at 18:25
It worked perfectrly fine but when i try to convert the database to pandas with `pd.DataFrame.from_records(Log.objects.all().values())` I get the following error `MemoryError: Unable to allocate 227. MiB for an array with shape (14852428,) and data type complex128`. I use windows — Δημήτριος Ιντζελερ, Jul 22 '21 at 08:25
You need more memory, then. Alternatively, dump the records to CSV/HDF5/Parquet/... and load that into Pandas. — AKX, Jul 22 '21 at 09:19

Memory error while parsing huge JSON file

1 Answers1