0

I'm trying to parse a huge 12 GB JSON file with almost 5 million lines(each one is an object) in python and store it to a database. I'm using ijson and multiprocessing in order to run it faster. Here is the code

def parse(paper):
    global mydata
 
    if 'type' not in paper["venue"]:
        venue = Venues(venue_raw = paper["venue"]["raw"])
        venue.save()
    else:
        venue = Venues(venue_raw = paper["venue"]["raw"], venue_type = paper["venue"]["type"])
        venue.save()
    paper1 = Papers(paper_id = paper["id"],paper_title = paper["title"],venue = venue)
    paper1.save()
            
    paper_authors = paper["authors"]
    paper_authors_json = json.dumps(paper_authors)
    obj = ijson.items(paper_authors_json,'item')
    for author in obj:
        mydata = mydata.append({'author_id': author["id"] , 'venue_raw': venue.venue_raw, 'year' : paper["year"],'number_of_times': 1},ignore_index=True)

if __name__ == '__main__':
    p = Pool(4)
 
    filename = 'C:/Users/dintz/Documents/finaldata/dblp.v12.json'
    with open(filename,encoding='UTF-8') as infile:
        papers = ijson.items(infile, 'item')   
        for paper in papers:
            p.apply_async(parse,(paper,))
    
            
    
    p.close()
    p.join()
            
    
 
    mydata = mydata.groupby(by=['author_id','venue_raw','year'], axis=0, as_index = False).sum()
    mydata = mydata.groupby(by = ['author_id','venue_raw'], axis=0, as_index = False, group_keys = False).apply(lambda x: sum((1+x.year-x.year.min())*numpy.log10(x.number_of_times+1)))
    df = mydata.index.to_frame(index = False)
    df = pd.DataFrame({'author_id':df["author_id"],'venue_raw':df["venue_raw"],'rating':mydata.values[:,2]})
    
    for index, row in df.iterrows():
        author_id = row['author_id']
        venue = Venues.objects.get(venue_raw = row['venue_raw'])
        rating = Ratings(author_id = author_id, venue = venue, rating = row['rating'])
        rating.save()

However I get the following error without knowing the reason enter image description here

Can somebody help me?

martineau
  • 119,623
  • 25
  • 170
  • 301
  • how much is your RAM – Tugay May 25 '21 at 17:39
  • You're probably not going to have a good time trying to use Pandas dataframes for this, if you want to be memory-frugal. – AKX May 25 '21 at 17:40
  • 2
    Also, using a `multiprocessing.Pool()` means your `mydata` will not be global across those multiple processes. – AKX May 25 '21 at 17:41
  • You should probably consider writing a script that turns your data into flat CSV/TSV files that you can import into your (apparently Django-modeled) database using the `LOAD INFILE` equivalent of your database. – AKX May 25 '21 at 17:43
  • @ΔημήτριοςΙντζελερ , it is obvious that you are running out of memory. You can't just load 12 GB file into 8 GB ram. you have to find a way to parse it in parts. I suggest looking at this question - https://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python – Tugay May 25 '21 at 17:56
  • 1
    @Tuqay Your link discusses using `ijson`, which is what OP is already doing. – AKX May 25 '21 at 17:57
  • There were some other options, so i left a link – Tugay May 25 '21 at 18:06

1 Answers1

1

I've had to make quite some extrapolations and assumptions, but it looks like

  • you're using Django
  • you want to populate an SQL database with venue, paper and author data
  • you want to then do some analysis using Pandas

Populating your SQL database can be done pretty neatly with something like the following.

  • I added the tqdm package so you get a progress indication.
  • This assumes there's a PaperAuthor model that links papers and authors.
  • Unlike the original code, this will not save duplicate Venues in the database.
  • You can see I replaced get_or_create and create with stubs to make this runnable without the database models (or indeed, without Django), just having the dataset you're using available.

On my machine, this consumes practically no memory, as the records are (or would be) dumped into the SQL database, not into an ever-growing, fragmenting dataframe in memory.

The Pandas processing is left as an exercise for the reader ;-), but I'd imagine it'd involve pd.read_sql() to read this preprocessed data from the database.

import multiprocessing

import ijson
import tqdm


def get_or_create(model, **kwargs):
    # Actual Django statement:
    # return model.objects.get_or_create(**kwargs)
    return (None, True)


def create(model, **kwargs):
    # Actual Django statement:
    # return model.objects.create(**kwargs)
    return None


Venue = "Venue"
Paper = "Paper"
PaperAuthor = "PaperAuthor"


def parse(paper):
    venue_name = paper["venue"]["raw"]
    venue_type = paper["venue"].get("type")
    venue, _ = get_or_create(Venue, venue_raw=venue_name, venue_type=venue_type)
    paper_obj = create(Paper, paper_id=paper["id"], paper_title=paper["title"], venue=venue)
    for author in paper["authors"]:
        create(PaperAuthor, paper=paper_obj, author_id=author["id"], year=paper["year"])


def main():
    filename = "F:/dblp.v12.json"
    with multiprocessing.Pool() as p, open(filename, encoding="UTF-8") as infile:
        for result in tqdm.tqdm(p.imap_unordered(parse, ijson.items(infile, "item"), chunksize=64)):
            pass


if __name__ == "__main__":
    main()
AKX
  • 152,115
  • 15
  • 115
  • 172
  • Unfortunatelly I get this error `django.db.utils.OperationalError: database is locked` Seems like I must run it without multiprocessing and just wait for a long time :( – Δημήτριος Ιντζελερ May 26 '21 at 18:17
  • You get `database is locked` only on SQLite databases. Try another database such as MySQL or PostgreSQL, or try running SQLite in WAL mode: https://code.djangoproject.com/ticket/24018 – AKX May 26 '21 at 18:25
  • It worked perfectrly fine but when i try to convert the database to pandas with `pd.DataFrame.from_records(Log.objects.all().values())` I get the following error `MemoryError: Unable to allocate 227. MiB for an array with shape (14852428,) and data type complex128`. I use windows – Δημήτριος Ιντζελερ Jul 22 '21 at 08:25
  • You need more memory, then. Alternatively, dump the records to CSV/HDF5/Parquet/... and load that into Pandas. – AKX Jul 22 '21 at 09:19