9

This is not about reading large JSON files, instead it's about reading large number of JSON files in the most efficient way.

Question

I am working with last.fm dataset from the Million song dataset. The data is available as a set of JSON-encoded text files where the keys are: track_id, artist, title, timestamp, similars and tags.

Currently I'm reading them into pandas in the following way after going through a few options as this is the fastest as shown here:

import os
import pandas as pd
try:
    import ujson as json
except ImportError:
    try:
        import simplejson as json
    except ImportError:
        import json


# Path to the dataset
path = "../lastfm_train/"

# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] 

data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)

The current method reads the subset (1% of full dataset in less than a second). However, reading the full train set is too slow and takes forever (I have waited for couple of hours as well) to read and has become a bottleneck for further tasks such as shown in question here.

I'm also using ujson for speed purposes in parsing json files which can be seen evidently from this question here

UPDATE 1 Using generator comprehension instead of list comprehension.

data_list=(json.load(open(file)) for file in all_files)
Community
  • 1
  • 1
TJain
  • 466
  • 1
  • 4
  • 18
  • 2
    It sounds like you don't have enough RAM, and it starts to use SWAP, the easiest solution would be to get more RAM. – Francisco Jan 13 '17 at 16:02
  • @FranciscoCouzo Easiest maybe.. but definitely not the cheapest! :D – dizzyf Jan 13 '17 at 16:14
  • I already have 16Gigs of that and I think that definitely enough for reading a normal dataset like this. Is there a better way of reading so many json files ? – TJain Jan 13 '17 at 16:15
  • 1
    Reading in all the JSONs one-by-one then writing columns you want out to a CSV would be borderline trivial, take virtually no RAM, and simplify the format so Pandas, Numpy or whatever else could read it en masse in a probably more efficient manner. – Nick T Jan 13 '17 at 16:27
  • I recommend you have a look at [dask](http://dask.pydata.org/en/latest/) as it can fit your needs well – Zeugma Jan 13 '17 at 16:46
  • 1
    From http://labrosa.ee.columbia.edu/millionsong/lastfm "Because going over JSON files is inefficient, and most people will only work on similarity or tags, **we provide two SQLite databases** with the data.". Download that then `pd.read_sql('SELECT similars, track_id FROM ...', ...)`. – Nick T Jan 13 '17 at 17:24
  • what is the total size of all your JSON files? – MaxU - stand with Ukraine Jan 13 '17 at 17:42
  • @MaxU: Total size of Jsons is 2.5 GBs. – TJain Jan 13 '17 at 17:53

2 Answers2

2

If you need to read and write the dataset multiple times, you could try converting .json files into a faster format. For example in pandas 0.20+ you could try using the .feather format.

tom
  • 2,007
  • 1
  • 10
  • 13
0

I would build an iterator on files and just yield the two columns you want.

Then you can instantiate a DataFrame with that iterator.

import os
import json
import pandas as pd

# Path to the dataset
path = "../lastfm_train/"

def data_iterator(path):
    for root, dirs, files in os.walk(path):
        for f in files:
            if f.endswith('.json'):
                fp = os.path.join(root,f)
                with open(fp) as o:
                    data = json.load(o)
                yield {"similars" : data["similars"], "track_id": data["track_id"]}


df = pd.DataFrame(data_iterator(path))
df.set_index('track_id', inplace=True)

This way you only go over your files list once and you won't duplicate the data before and after passing it to DataFrame

arthur
  • 2,319
  • 1
  • 17
  • 24
  • Please see the update, using generator comprehension already. – TJain Jan 13 '17 at 17:53
  • 1
    Using the code I posted (with native package `json`), I loaded the full train set dataset in `104 seconds` – arthur Jan 16 '17 at 11:03
  • My code as well as your code loads the the subset(1%) in 1.04 seconds and hence, expected the full to load in something around 100s. However, on my machine your code takes 406s and my code takes 473s. I'm using a Macbook Pro 2015 version with 16GB RAM. What beast are you using that is scaling to the full dataset perfectly? – TJain Jan 17 '17 at 19:56
  • :-) I work on a dedicated server like this one : https://www.ovh.co.uk/dedicated_servers/details-servers-range-BIGDATA-id-2016-HG-1.xml – arthur Jan 18 '17 at 10:19
  • Seriously? I hope you're kidding. – TJain Jan 18 '17 at 15:42