4

I am working with last.fm dataset from the Million song dataset. The data is available as a set of json-encoded text files where the keys are: track_id, artist, title, timestamp, similars and tags.

Using the similars and track_id fields, I'm trying to create a sparse adjacency matrix so that I could further do other tasks with the dataset. Following is my attempt. However, it's very slow (Especially the to_sparse, opening and loading all json files and the slowest is the apply function I've come up with, even though it's after a few improvements :/ ). I'm new to pandas and I've improved from this from my very first attempt but I'm sure some vectorisation or other methods will significantly boost up the speed and efficiency.

import os
import pandas as pd
import numpy as np

# Path to the dataset
path = "../lastfm_subset/"

# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] 

data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)

a = pd.DataFrame(0,columns= df.index, index=df.index).to_sparse()

def make_graph(adjacent):
    importance = 1/len(adjacent['similars'])
    neighbors = list(filter(lambda x: x[1] > threshold, adjacent['similars']))
    if len(neighbors) == 0:
        return

    t_id, similarity_score  = map(list, zip(*neighbors))
    a.loc[list(t_id), adjacent['track_id']] = importance


df[( df['similars'].str.len() > 0 )].reset_index()[['track_id','similars']].apply(make_graph, axis=1)

I also believe that the way of reading the dataset could be highly improved and better written as well.

So, we just need to read the data and then make a sparse adjacency matrix from the adjacency list in an efficient manner.

The similars key has a list of list. The inner lists are 1x2 with track_id of similar song and similarity score.

As I am new to this subject, I am open to tips, suggestions and better methods available for any part of tasks like these.

UPDATE 1

After taking input from comments, a slightly better version though it's still far from being at acceptable speeds. The good part, the apply function works reasonably fast. However, the list comprehension of opening and loading json files to make data_list is very slow. Moreover, to_sparse takes forever, so I worked without creating a sparse matrix.

import os
import pandas as pd
import numpy as np

# Path to the dataset
path = "../lastfm_subset/"

# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] 

data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)
df.loc[( df['similars'].str.len() > 0 ), 'importance' ] = 1/len(df['similars']) # Update 1


a = pd.DataFrame(df['importance'],columns= df.index, index=df.index)#.to_sparse(fill_value=0)

def make_graph(row):
    neighbors = list(filter(lambda x: x[1] > threshold, row['similars']))
    if len(neighbors) == 0:
        return

    t_id, similarity_score  = map(list, zip(*neighbors))
    a.loc[list(t_id), row['track_id']] = row['importance']


df[( df['similars'].str.len() > 0 )].reset_index()[['track_id','similar', 'importance']].apply(make_graph, axis=1)

Update 2

Using generator comprehension instead of list comprehension.

data_list=(json.load(open(file)) for file in all_files)

I'm also using ujson for speed purposes in parsing json files which can be seen evidently from this question here

try:
    import ujson as json
except ImportError:
    try:
        import simplejson as json
    except ImportError:
        import json
Community
  • 1
  • 1
TJain
  • 466
  • 1
  • 4
  • 18
  • 1) Where does the df `adjacent` come from? 2) this line: `a = pd.DataFrame(0,columns= df.index, index=df.index).to_sparse()` has lots of problems: `DataFrame.data` should usually be a numpy array, `columns` probably shouldn't be `index`s, and most importantly if you're filling with zeros you need `.to_sparse(fill_value=0)` or you aren't actually creating a sparse dataframe. – Daniel F Jan 13 '17 at 07:53
  • Also, looking at your `apply` I think there are many ways to improve it. Parts of it can be handled by simpler operations, which are going to be likely faster. For instance, my hunch is that you can replace the first line of your apply by simply doing something like `df["importance"] = 1/len(df['similars'])`. – Mikk Jan 13 '17 at 10:48
  • @DanielForsman, the df `adjacent` is passed from the apply function. Also, I'm making an Adjacency matrix, that's why I'm making `columns` as `index`s. – TJain Jan 13 '17 at 12:34
  • @DanielForsman and Mikk, check out the updated version. And thank you for the tip, Mikk. – TJain Jan 13 '17 at 13:38
  • I don't think you can do anything about the `data_list` step, unless someone can point you to a faster third party `json` decoder. Or replace the `json` files with `csv` ones. – hpaulj Jan 13 '17 at 18:13
  • 1
    That `a.loc[...] = ...` step is still likely to be slow if `a` is a sparse dataframe. I know that adding nonzero values to a `scipy` sparse matrix is quite slow. It has to change the sparse indexing as well as the value. – hpaulj Jan 13 '17 at 18:20
  • @hpaulj, check the second update. – TJain Jan 13 '17 at 18:24
  • is it possible to do this efficiently in numpy sparse matrices ? – TJain Jan 13 '17 at 23:27
  • I don't think Pandas is giving you much benefit here. Have you considered making a sparse "array" with just a regular `dict`? Also, have you tried to use the sqlite database on your first linked page? It should offer better performance than the seperate json files. – user7138814 Jan 16 '17 at 16:13

0 Answers0