1

I am starting development with R and I am still having "beginner problems" with the language. I would like to do the following:

  1. I have a matrix (data frame:=user) with ~900 columns, each of them is the name of a band (Nirvana, Green Day, Daft-Punk, etc.).
  2. In each row I have an user and the user's music taste (Nirvana = 10, Green Day=5, Daft Punkt=0)
  3. I would like to query another dataframe(:=artists - with the artist's music tags) and substitute the name of the bands by its Genre-Tag (Nirvana --> Rock, Green Day --> Rock, Daft-Punk --> Techno). There are ~120 Tags for music taste (120 < 900)
  4. And finally, I would like to "aggregate" the values over all columns to avoid duplicated columns. In the example from (3) - with the aggregation function "SUM" - the row would have only 2 entries and not 3: (Rock = 15, Techno=0)

Any clues on how to do that with R? Thanks in advance for any help!

Data:

user: pastebin.com/4gVe004T

artists: pastebin.com/dm7weLMG

Thomas
  • 43,637
  • 12
  • 109
  • 140
jcdmb
  • 3,016
  • 5
  • 38
  • 53
  • 2
    Without actual sample data from the two dataframes, you're going to get general responses to this, which will suggest using `rowSums` and looking at [this question](http://stackoverflow.com/questions/15303283/how-to-do-vlookup-and-fill-down-like-in-excel-in-r). – Thomas Jul 02 '13 at 11:01
  • 1
    Could you provide this in the form of the output from `dput(head(artists))` and `dput(head(user[,1:5]))`? – Thomas Jul 02 '13 at 11:10
  • user: http://pastebin.com/4gVe004T – jcdmb Jul 02 '13 at 11:18
  • artists: http://pastebin.com/dm7weLMG – jcdmb Jul 02 '13 at 11:19
  • 1
    How do you want to aggregate by tag? My count (from `unique(unlist(sapply(artists[,2:6],levels)))`) is that you have 505 unique tags, which means 505 columns (one per tag) to add to `user`. And tags are not mutually exclusive, so a score for a user-band pair might end up counting in the scores for multiple tag variables. Is that what you want? – Thomas Jul 02 '13 at 11:29
  • 2
    +1 for rock > techno. – Hong Ooi Jul 02 '13 at 11:33
  • levels(artists$tag1) gives me 121 values for example - and I have 991 artists. I would like to replace those 991 artist by its tag1. i.e.: I would have only 121 columns instead of 991 – jcdmb Jul 02 '13 at 11:35
  • I am surprised this has not been down-voted. Here's my data, you write me code. No. Show us you have made some effort. -1. – Simon O'Hanlon Jul 02 '13 at 11:56
  • @SimonO101: I Simon, you may have right, but I think you have misunderstood my question: I am not looking for a "Code"-Solution but for some Ideas on how to start my task. That's what MvG gave me. And Now - with this basis - I'll write my own code. Thanks for your feedback anyway. – jcdmb Jul 02 '13 at 12:10
  • 2
    @jcdmb *Any clues on how to do that with R* generally translates into some code to illustrate an example with toy data. I assume you have already seen this famous question on [**how to make a great reproducible example**](http://stackoverflow.com/q/5963269/1478381) – Simon O'Hanlon Jul 02 '13 at 12:14

1 Answers1

2

I have a matrix (data frame:=user) with ~900 columns, each of them is the name of a band (Nirvana, Green Day, Daft-Punk, etc.).
In each row I have an user and the user's music taste (Nirvana = 10, Green Day=5, Daft Punkt=0)

This is so-called “wide” format. It would be better for most tasks to reshape this to narrow format, i.e. to a single data.frame with two columns, one which identifies the user and another which identifies the band. There are several tools to do this, and several questions here on SO. Look for the tag in particular.

There also is a package called reshape which can help here. There the process I'm talking about is called “melting” the data.

I would like to query another dataframe(:=artists - with the artist's music tags) and substitute the name of the bands by its Genre-Tag (Nirvana --> Rock, Green Day --> Rock, Daft-Punk --> Techno). There are ~120 Tags for music taste (120 < 900)

You can use merge to combine multiple data frames, using the band name as merge key. This is the reason why you'd want the band names to be values, not column names.

And finally, I would like to "aggregate" the values over all columns to avoid duplicated columns. In the example from (3) - with the aggregation function "SUM" - the row would have only 2 entries and not 3: (Rock = 15, Techno=0)

When you use reshape to “cast” your data back to wide format, you can supply an aggregate function which will be used to combine values. You can use sum for that.

MvG
  • 57,380
  • 22
  • 148
  • 276