I am working with last.fm dataset from the Million song dataset. The data is available as a set of json-encoded text files where the keys are: track_id, artist, title, timestamp, similars and tags.
Using the similars and track_id fields, I'm trying to create a sparse adjacency matrix so that I could further do other tasks with the dataset. Following is my attempt. However, it's very slow (Especially the to_sparse
, opening
and loading
all json
files and the slowest is the apply
function I've come up with, even though it's after a few improvements :/ ). I'm new to pandas and I've improved from this from my very first attempt but I'm sure some vectorisation or other methods will significantly boost up the speed and efficiency.
import os
import pandas as pd
import numpy as np
# Path to the dataset
path = "../lastfm_subset/"
# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')]
data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)
a = pd.DataFrame(0,columns= df.index, index=df.index).to_sparse()
def make_graph(adjacent):
importance = 1/len(adjacent['similars'])
neighbors = list(filter(lambda x: x[1] > threshold, adjacent['similars']))
if len(neighbors) == 0:
return
t_id, similarity_score = map(list, zip(*neighbors))
a.loc[list(t_id), adjacent['track_id']] = importance
df[( df['similars'].str.len() > 0 )].reset_index()[['track_id','similars']].apply(make_graph, axis=1)
I also believe that the way of reading the dataset could be highly improved and better written as well.
So, we just need to read the data and then make a sparse adjacency matrix from the adjacency list in an efficient manner.
The similars key has a list of list. The inner lists are 1x2 with track_id of similar song and similarity score.
As I am new to this subject, I am open to tips, suggestions and better methods available for any part of tasks like these.
UPDATE 1
After taking input from comments, a slightly better version though it's still far from being at acceptable speeds. The good part, the apply
function works reasonably fast. However, the list comprehension of opening and loading json files to make data_list
is very slow. Moreover, to_sparse
takes forever, so I worked without creating a sparse matrix.
import os
import pandas as pd
import numpy as np
# Path to the dataset
path = "../lastfm_subset/"
# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')]
data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)
df.loc[( df['similars'].str.len() > 0 ), 'importance' ] = 1/len(df['similars']) # Update 1
a = pd.DataFrame(df['importance'],columns= df.index, index=df.index)#.to_sparse(fill_value=0)
def make_graph(row):
neighbors = list(filter(lambda x: x[1] > threshold, row['similars']))
if len(neighbors) == 0:
return
t_id, similarity_score = map(list, zip(*neighbors))
a.loc[list(t_id), row['track_id']] = row['importance']
df[( df['similars'].str.len() > 0 )].reset_index()[['track_id','similar', 'importance']].apply(make_graph, axis=1)
Update 2
Using generator comprehension instead of list comprehension.
data_list=(json.load(open(file)) for file in all_files)
I'm also using ujson
for speed purposes in parsing json files which can be seen evidently from this question here
try:
import ujson as json
except ImportError:
try:
import simplejson as json
except ImportError:
import json