0

Let's say I have a data frame df that looks like this:

df = data.frame(c("A", "A", "B", "B", "C", "D", "D", "D", "E"), 
        c(0.1, 0.3, 0.1, 0.8, 0.4, 0.7, 0.5, 0.2, 0.1),
        c("v1", "v2", "v1", "v3", "v4", "v2", "v3", "v4", "v2"))

colnames(df) = c("entry", "value", "point")
df = df[order(df$entry, -df$value),]

df
   entry value point
2     A   0.3    v2
1     A   0.1    v1
4     B   0.8    v3
3     B   0.1    v1
5     C   0.4    v4
6     D   0.7    v2
7     D   0.5    v3
8     D   0.2    v4
9     E   0.1    v2

I would like to convert it eventually into a matrix of "ranked lists", that has as rows the unique values in the entry column and the number of columns should be equal to the maximum number of unique elements in the point column for a given entry. In this example it would be 3. Each row should be populated with the corresponding values from the point column, sorted descendingly based on the corresponding elements in value (e.g., row A should have v2 as value in the first column). In case an entry has less points than the number of columns in the matrix, the rest of the row should be filled with NAs.

So, the expected output should look something like this:

>df
   1   2   3  
A  v2  v1  NA 
B  v3  v1  NA 
C  v4  NA  NA 
D  v2  v3  v4
E  v2  NA  NA

So far I have tried to create some sort of contingency table using

with(df, table(df$point, df$entry))

but of course my actual data is in the order of millions of entries, and the above command raises to huge amounts of RAM even when subsetting to 100 entries with a couple hundreds of unique points. I have also tried

xtabs(~ entry + point, data=df)

with the same results on my real data. Next I have tried to split it into ordered lists using

df = split(df$point, df$entry)

which works fine and it is fast enough, buuuuut.. now I have problems converting it to the result matrix. Something along those lines probably

matrix(sapply(df, function(x) unlist(x)), nrow=length(df), ncol=max(sapply(df, length)))

or first initialize a matrix and do some rbind or something?

res = matrix(NA, nrow=length(df), ncol=max(sapply(df, length)))
rownames(res) = names(df)
....

Can you please assist?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Marius
  • 990
  • 1
  • 14
  • 34
  • 2
    Seems like this may do it: [Transpose / reshape dataframe without “timevar” from long to wide format](https://stackoverflow.com/questions/11322801/transpose-reshape-dataframe-without-timevar-from-long-to-wide-format), e.g. `dcast(df, entry ~ rowid(entry), value.var = "point")` – Henrik Jan 03 '19 at 14:24
  • @Henrik It's along those lines, but the problem with `reshape` is that it keeps as columns all the unique `value`s and not the maximum number of elements for a given `entry`. Plus, they are not sorted. If you look at the expected output, maybe it is clearer – Marius Jan 03 '19 at 14:30
  • 1
    At least for the example provided in the post, my suggestion gives the desired output. And sorry, I forgot `library(data.table)`, `setDT(df)` before the `dcast`. – Henrik Jan 03 '19 at 14:32
  • @Henrik: it doesn't really. it produces (only first line as example): `entry point.0.3 point.0.1 point.0.8 point.0.4 point.0.7 point.0.5 point.0.2` `2 A v2 v1 ` It should be only `A v2 v1 NA` .. not 7 columns, but thank you for the suggestion – Marius Jan 03 '19 at 14:34
  • Forgot to mention (again..;): I started with your sorted data. (if you use `setDT` and `dcast`, you can just as well sort it with `setorder(df, entry, -value)`). – Henrik Jan 03 '19 at 14:51
  • @Henrik Yes, like that indeed it generates the desired output :] You can still set it as an answer if you will or modify the comment accordingly. Thank you for the help! – Marius Jan 03 '19 at 14:55
  • No, I prefer not to reiterate answers already provided elsewhere. – Henrik Jan 03 '19 at 14:57
  • @Henrik: well, you should. It proved to be the most stable and the fastest solution among all the suggested ones. I would accept it as **the** answer. Thank you once again! – Marius Jan 03 '19 at 15:42

2 Answers2

2

With dplyr:

df %>% 
   group_by(entry) %>% 
   mutate(unq=rank(rev(value))) %>% 
   select(-value) %>% 
   tidyr::spread(unq,point)
# A tibble: 5 x 4
# Groups:   entry [5]
  entry `1`   `2`   `3`  
  <fct> <fct> <fct> <fct>
1 A     v2    v1    NA   
2 B     v3    v1    NA   
3 C     v4    NA    NA   
4 D     v2    v3    v4   
5 E     v2    NA    NA   
jyjek
  • 2,627
  • 11
  • 23
  • Perfect :] Can you please add also that it requires the `tidyr` package for the `spread` function. Thanks a bunch for the help – Marius Jan 03 '19 at 14:43
  • yeah, forgot about it – jyjek Jan 03 '19 at 14:47
  • Just a heads up: in case of millions of entries in your data frame you might very well get a `Error: Duplicate identifiers for rows` from the `tidyr::spread` function which will eventually force you to assign new row IDs via `tibble::rowid_to_column()` as the second line in the above command.. issue is known but not fixed – Marius Jan 03 '19 at 15:36
1

Consider using by to split by entry and build needed vectors. For same length rows in final matrix, add NA as needed where the below 3 can be changed to however many columns required.

vec_list <- by(df, df$entry, function(sub) {
    vec <- as.character(sub[order(-sub$value),]$point)    
    c(vec, rep(NA, 3 - length(vec)))    
})

final_matrix <- do.call(rbind, vec_list)

final_matrix
#   [,1] [,2] [,3]
# A "v2" "v1" NA  
# B "v3" "v1" NA  
# C "v4" NA   NA  
# D "v2" "v3" "v4"
# E "v2" NA   NA 

Rextester Demo

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    Yeah, this works too. Maybe to avoid the hardcoded `3` it can be passed as a parameter via `max(table(df$entry))` or something along those lines. Thanks for this `r base` solution! – Marius Jan 03 '19 at 15:02