3

I have data like that:

object category country
495647 1        RUS  
477462 2        GER  
431567 3        USA  
449136 1        RUS  
367260 1        USA  
495649 1        RUS  
477461 2        GER  
431562 3        USA  
449133 2        RUS  
367264 2        USA  
...

where one object appears in various (category, country) pairs and countries share a single list of categories.

I'd like to add another column to that, which would be a category weight per country - the number of objects appearing in a category for a category, normalized to sum up to 1 within a country (summation only over unique (category, country) pairs).

I could do something like:

aggregate(df$object, list(df$category, df$country), length)

and then calculate the weight from there, but what's a more efficient and elegant way of doing that directly on the original data.

Desired example output:

object category country weight
495647 1        RUS     .75
477462 2        GER     .5 
431567 3        USA     .5 
449136 1        RUS     .75
367260 1        USA     .25
495649 1        RUS     .75
477461 3        GER     .5
431562 3        USA     .5
449133 2        RUS     .25
367264 2        USA     .25
...

The above would sum up to one within country for unique (category, country) pairs.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
nikola
  • 5,286
  • 3
  • 22
  • 19
  • 2
    probably belongs on stackoverflow – Peter Ellis Jun 18 '12 at 10:30
  • 1
    Some expected output data would be nice; e.g., to clarify whether the result for rows 1 and 4 should both be 50% or both be 100%. You ask for the category weight by country, directly on the original data, so that seems to imply the latter? But then it won't sum to 1 within country, iiuc. – Matt Dowle Jun 18 '12 at 11:53
  • thanks Matthew, you are quite right - it should be 100% ( or 1 ) for rows 1 and 4. the weight should sum up to 1 within a country only accounting for unique (category, country) pairs - i'm editing the question. – nikola Jun 18 '12 at 12:09

3 Answers3

3

Responding specifically with the final sentence in mind: "What's a more efficient and elegant way of doing that directly on the original data.", it just so happens that data.table has a new feature for this.

install.packages("data.table", repos="http://R-Forge.R-project.org")
# Needs version 1.8.1 from R-Forge.  Soon to be released to CRAN.

With your data in DT :

> DT[, countcat:=.N, by=list(country,category)]     # add 'countcat' column
    category country countcat
 1:        1     RUS        3
 2:        2     GER        1
 3:        3     USA        2
 4:        1     RUS        3
 5:        1     USA        1
 6:        1     RUS        3
 7:        3     GER        1
 8:        3     USA        2
 9:        2     RUS        1
10:        2     USA        1

> DT[, weight:=countcat/.N, by=country]     # add 'weight' column
    category country countcat weight
 1:        1     RUS        3   0.75
 2:        2     GER        1   0.50
 3:        3     USA        2   0.50
 4:        1     RUS        3   0.75
 5:        1     USA        1   0.25
 6:        1     RUS        3   0.75
 7:        3     GER        1   0.50
 8:        3     USA        2   0.50
 9:        2     RUS        1   0.25
10:        2     USA        1   0.25

:= adds a column by reference to the data and is an 'old' feature. The new feature is that it now works by group. .N is a symbol that holds the number of rows in each group.

These operations are memory efficient and should scale to large data; e.g., 1e8, 1e9 rows.

If you don't wish to include the intermediate column countcat, just remove it afterwards. Again, this is an efficient operation which works instantly regardless of the size of the table (by moving pointers internally).

> DT[,countcat:=NULL]     # remove 'countcat' column
    category country weight
 1:        1     RUS   0.75
 2:        2     GER   0.50
 3:        3     USA   0.50
 4:        1     RUS   0.75
 5:        1     USA   0.25
 6:        1     RUS   0.75
 7:        3     GER   0.50
 8:        3     USA   0.50
 9:        2     RUS   0.25
10:        2     USA   0.25
> 
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • it is indeed elegant and efficient solution, thank you! One note: I wasn't familiar with `data.table` package and found that 1./ the R-project.org repository requires R > 2.15 and 2./ DT must be a `data.table` - so I had to do `DT = data.table(DT)` on my original data frame. – nikola Jun 18 '12 at 15:05
  • As an infrequent user of data.table, I am sometimes confused by the different ways to accomplish things and I often have to look up things. Are there any plans to build a more comprehensive documentation that will replace the FAQ that constitutes the vignette at the moment? – ilprincipe Jun 18 '12 at 15:11
  • @ilprincipe I know what you mean and fully agree. There are no plans though because we don't know what it should look like! What would you like? Would a wiki with common use cases be good? Any help you can provide _what_ is needed and _how_ would be really appreciated. If you could sketch out or start the format that you'd like I'll gladly fill it in. A new page on the existing data.table wiki? Somewhere else? Wiki format or something that is reproducible? We need help and guidance on this please. Even a list of the common idioms as you see them would be a great start. – Matt Dowle Jun 18 '12 at 16:26
  • @nikola Apologies, I should have mentioned that to you. Well done for working it out. Btw, `as.data.table` is usually faster than `data.table` to convert, but we'll change it so you don't need to know that for convenience. Or just start with `data.table` in the first place. – Matt Dowle Jun 18 '12 at 16:30
  • I guess an extension of the Wiki format would be desirable. As a first step, it could be a unique address that lists _all_ idioms and a more or less complete list of common usage cases. A proper vignette/intro should imho focus more on introducing data.table and its usage by itself, and to a lesser degree on its comparison with data frames and computational speed. I know speed is a selling point, but it should be separated more. Most users figure this out pretty quickly anyways. I feel I am not familiar enough with data.table yet to contribute much, but I will think about this a bit more. – ilprincipe Jun 18 '12 at 17:11
  • hey, @MatthewDowle i must revisit this because i failed to install the 1.8.1 on our gentoo production server from the r-project.org repo (i see only mac and windows builds there). When is that expected to go to CRAN, or could I get the 1.8.1 source and compile it myself; or could you possibly elaborate on a 1.8.0 solution? I'd be very thankful on any feedback. – nikola Jun 25 '12 at 18:43
  • @nikola failed to install? What did you type? What was the error message? Obviosly there is an error somewhere, but I'm not a mind reader. – Matt Dowle Jun 25 '12 at 20:19
  • @MatthewDowle sorry, i should've been clearer; I successfully installed it on OS X using `install.packages("data.table", repos="http://R-Forge.R-project.org")` but the same didnt work on a 64 bit gentoo linux machine raising this warning message: _package 'data.table' is not available (for R version 2.15.0)_ – nikola Jun 25 '12 at 21:53
  • @nikola Ahah, thanks. If you read again the R-Forge page it does state it is only available in binary form for latest R I.e. 2.15.0. Happily I think it also says you can set method="source" and it should build ok for previous versions of R, or upgrade to R 2.15.0+. It isn't on CRAN yet because I need to iron out the minor bugs on the bug tracker. – Matt Dowle Jun 25 '12 at 22:10
  • @MatthewDowle thanks, but that's the thing - i am already running 2.15.0 on that machine too.. so it got me wondering - could it be that this http://download.r-forge.r-project.org/bin/ has no linux/unix index .. ? – nikola Jun 25 '12 at 22:18
  • anyway, i will try to build it from source and see what happens – nikola Jun 25 '12 at 22:21
  • @nikola. Now R 2.15.1 is out maybe it already requires 2.15.1, but I thought it was 2.15*. Or did you check build and check logs displayed on R-Forge for anything strange? – Matt Dowle Jun 26 '12 at 06:19
  • @MatthewDowle fyi the latest svn checkout compiled (R CMD INSTALL) under the gentoo linux r 2.15.0, thank you ! – nikola Jun 26 '12 at 12:31
  • @Nikola That's great news. Thanks for confirming. – Matt Dowle Jun 26 '12 at 12:55
2

I actually asked a similar question some time ago. data.table is really nice for this, especially now that := by group is implemented, and a self join is not necessary anymore - as illustrated above. the best solution from base R is ave(). tapply() can also be used.

This is similar to the solution above, using ave(). However, I highly recommend you look at data.table.

df$count <- ave(x = df$object, df$country, df$category, FUN = length)
df$weight <- ave(x = df$count, df$country, FUN = function(x) x/length(x))
Community
  • 1
  • 1
ilprincipe
  • 856
  • 6
  • 23
1

I don't see a readable way to do it in one line. But it can be quite compact.

# Use table to get the counts.
counts <- table(df[,2:3])
# Normalize the table
weights <- t(t(counts)/colSums(counts))
# Use 'matrix' selection by names.
df$weight <- weights[as.matrix(df[,2:3])]
gui11aume
  • 2,768
  • 3
  • 18
  • 23