0

I was playing with some data and trying to create a new data frame that contains key and value pairs that could be a dictionary. Here's some sample data and a quick manual solution.

df = data.frame(col1 = c("one", "one", "two", "two", "one"),
                col2 = c("AG", "AB", "AC", "AG", "AB"),
                col3 = c("F3", "F1", "F2", "F3", "F2") )
df
d1 = data.frame(vals = unique(df$col1))
d2 = data.frame(vals = unique(df$col2))
d3 = data.frame(vals = unique(df$col3))
d1 
d2 
d3
d1$name = "col1"
d2$name = "col2"
d3$name = "col3"
d1
d2
d3
rbind(d1,d2,d3)

Of course, this is a simple use case so real data is going to be a bit more mundane. For that reason, I was looking for a loop that could go through and set the key value pairs in a dictionary.

Most of my attempts have resulted in failure. Here's the format for my solution but I'm not sure how to dynamically create the new_df dictionary. Any suggestions?

new_df=data.frame()
prod.cols = c("col1", "col2", "col3")
for(col in prod.cols){
  if(col %in% colnames(df)){
    ## solution in here
  }
}

new_df
AGUY
  • 425
  • 1
  • 8
  • 13

3 Answers3

5

tidyr makes this easy:

library(tidyr)

df %>% gather(name, vals) %>% unique()

#    name vals
# 1  col1  one
# 3  col1  two
# 6  col2   AG
# 7  col2   AB
# 8  col2   AC
# 11 col3   F3
# 12 col3   F1
# 13 col3   F2
alistaire
  • 42,459
  • 4
  • 77
  • 117
4

alistaire's answer is quite elegant and readable. Just for fun, here's a base R approach. Not that efficiency is particularly important here, but this scales relatively well as more rows and columns are added:

My second and third approaches are nicer than my first, so I'm moving them to the top of the answer:

Approach # 2, implementing thelatemail's comment for a nice, efficient one-liner:

stack(lapply(df, function(ii) as.character(unique(ii))))

What's nice about this solution is that it first reduces the columns using unique, which makes less work for as.character and then for stack.


Approach # 3: more concise and more efficient version of approach 2 that avoids the need for unique and character conversion by using levels to deal with the factor columns:

stack(lapply(df, levels))

First approach:

Reduce(rbind,
    lapply(seq_along(df), 
        function(ii) data.frame(vals = unique(df[, ii]), name = names(df)[ii])
    )
)

# vals name
#1 one col1
#2 two col1
#3  AG col2
#4  AB col2
#5  AC col2
#6  F3 col3
#7  F1 col3
#8  F2 col3

Using do.call instead of Reduce is roughly equivalent here:

do.call(rbind,
    lapply(seq_along(df), 
        function(ii) data.frame(vals = unique(df[, ii]), name = names(df)[ii])
    )
) 
Jota
  • 17,281
  • 7
  • 63
  • 93
  • 2
    Looks like a `stack` operation to me - `unique(stack(lapply(df,as.character)))` – thelatemail Jun 26 '16 at 23:14
  • @thelatemail yea, you're right. I had tried using `stack` but was missing the character conversion step. I like this modification to avoid creating the larger structure before calling `unique`: `stack(lapply(df, function(ii) as.character(unique(ii))))`. Should be pretty efficient. Wanna add an answer? – Jota Jun 26 '16 at 23:19
  • 1
    Your modification is very clever - feel free to just edit it in if you like. I don't need more imaginary internet points :-) – thelatemail Jun 26 '16 at 23:26
  • yes, your stack answer is nice... you should move it to the top of your answer less it gets missed – user20650 Jul 05 '16 at 23:44
1

We can also do

library(reshape2)
unique(melt(as.matrix(df))[-1])
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    you can use the `id` arg to save the as.matrix call... but it is of course irrelevant ;) `unique(reshape2::melt(df, id=0))` – user20650 Jul 05 '16 at 21:13