1

I am attempting to read two .dat files and create a program that uses the value of aid2name as a key in a dictionary that has the key and values of aid2numplays, set as its values. This is all done in hopes that the file will produce a result that includes (artist name, artist id, frequency of plays). Worth noting that the first file provides artist name and artist id, while the second file provides user id, artist id, and frequency per user. Any ideas how to aggregate those frequencies by user and then display them in the (artist name, artist id, frequency of plays) format? Below is what I have managed so far:

import codecs
aid2name = {}
d2 = {}
fp = codecs.open("artists.dat", encoding = "utf-8")
fp.readline()  #skip first line of headers
for line in fp:
    line = line.strip()
    fields = line.split('\t')
    aid = int(fields[0])
    name = fields[1]
    aid2name = {int(aid), name}
    d2.setdefault(fields[1], {})
    #print (aid2name)
# do other processing
    #print(dictionary)

aid2numplays = {}
fp = codecs.open("user_artists.dat", encoding = "utf-8")
fp.readline()  #skip first line of headers
for line in fp:
    line = line.strip()
    fields = line.split('\t')
    uid = int(fields[0])
    aid = int(fields[1])
    weight = int(fields[2])
    aid2numplays = [int(aid), int(weight)]
    #print(aid2numplays)
    #print(uid, aid, weight)

for (d2.fields[1], value) in d2:
    group = d2.setdefault(d2.fields[1], {}) # key might exist already
    group.append(aid2numplays)

print(group)
  • It might help to see an example for what the final data structure should look like, I'm not certain how you're intending to use [setdefault](http://stackoverflow.com/questions/3483520/use-cases-for-the-setdefault-dict-method) – brennan Apr 11 '17 at 21:06

1 Answers1

1

Edit: Regarding the use of setdefault, if you wanted to group the user data by artistID then you could:

grouped_data = {}
for u in users:
    k, v = u[1], {'userID': u[0], 'weight': u[2]}
    grouped_data.setdefault(k, []).append(v)

This is essentially the same as writing:

grouped_data = {}
for u in users:
    k, v = u[1], {'userID': u[0], 'weight': u[2]}
    if k in grouped_data:
        grouped_data[k].append(v)
    else:
        grouped_data[k] = [v]

As an example for how to count the number of times an artist appears in different users data, you could read the data into lists of lists:

with codecs.open("artists.dat", encoding = "utf-8") as f:
    artists = f.readlines()

with codecs.open("user_artists.dat", encoding = "utf-8") as f:
    users = f.readlines()

artists = [x.strip().split('\t') for x in artists][1:]  # [['1', 'MALICE MIZER', ..
users = [x.strip().split('\t') for x in users][1:]  # [['2', '51', '13883'], ..]

Iterate over artists creating a dictionary using the artistID as a key. Add a placeholder for the play stats.

data = {}
for a in artists:
    artistID, name = a[0], a[1]
    data[artistID] = {'name': name, 'plays': 0}

Iterate over users updating the dictionary with each row:

for u in users:
    artistID = u[1]
    data[artistID]['plays'] += 1

Output for data:

{'1': {'name': 'MALICE MIZER', 'plays': 3},
 '2': {'name': 'Diary of Dreams', 'plays': 12},
 '3': {'name': 'Carpathian Forest', 'plays': 3},  ..}

Edit: To iterate over the user data and create a dictionary of all the artists associated with a user we could:

artist_list = [x.strip().split('\t') for x in artists][1:]
user_stats_list = [x.strip().split('\t') for x in users][1:]

artists = {}
for a in artist_list:
    artistID, name = a[0], a[1]
    artists[artistID] = name

grouped_user_stats = {}
for u in user_stats_list:
    userID, artistID, weight = u
    if userID not in grouped_user_stats:
        grouped_user_stats[userID] = { artistID: {'name': artists[artistID], 'plays': 1} }
    else:
        if artistID not in grouped_user_stats[userID]:
            grouped_user_stats[userID][artistID] = {'name': artists[artistID], 'plays': 1}
        else:
            grouped_user_stats[userID][artistID]['plays'] += 1
            print('this never happens') 
            # it looks the same artist is never listed twice for the same user

Output:

{'2': {'100': {'name': 'ABC', 'plays': 1},
       '51': {'name': 'Duran Duran', 'plays': 1},
       '52': {'name': 'Morcheeba', 'plays': 1},
       '53': {'name': 'Air', 'plays': 1}, .. }, 
 ..
}
brennan
  • 3,392
  • 24
  • 42
  • To finally display them in (artist name, artist id, frequency of plays) format do: `[{'id': k, **v} for k, v in data.items()]` or `[(k, **v) for k, v in data.items()]` for a list of dictionaries respective tuples. – mab Apr 11 '17 at 20:02
  • Thank you both. I had been looking at it from a different perspective but I feel like it's clicking now. Bren, just out of curiosity, how did you know details about outputs without the files? – pythonuser890 Apr 12 '17 at 03:14
  • Additionally, I was wondering how to aggregate the total plays by each artist? I've been struggling to add the plays by userID to then display them as an aggregate. Any tips, advice are really helpful. Thank you so much for giving those tips before. – pythonuser890 Apr 12 '17 at 03:32
  • `data[artistID]['plays'] += 1` is adding one to each artist's play count for each row in user_artist with the artistID. Found the data here: https://github.com/tokosa-sub/R/tree/master/LastFM/dataset – brennan Apr 12 '17 at 04:44
  • Oh cool, had no idea it was hosted there. The program is supposed to aggregate the play counts. So while User 1 may listen to Artist A 5 times, User 2 listened to Artist A 3 times, and User 56 listened to Artist A 10 times, my output should be: Artist A (artistid #) 18. – pythonuser890 Apr 12 '17 at 16:46
  • I see, aggregated by user not artist – brennan Apr 12 '17 at 18:22
  • Well the output total is the number of aggregated plays for each artist from the pool of users but I was unsure of how to write a loop for it or using get() – pythonuser890 Apr 12 '17 at 19:03
  • Edited: so maybe this is not a play list, this appears to be instead a user's list of artists by weighted preference – brennan Apr 12 '17 at 19:47
  • Either way I think we answered the setdefault question. I actually prefer not to use it. I think it's less readable then using if:else or defaultdict, – brennan Apr 12 '17 at 19:53