1

I'm new here. I have just started learning R.

I have this question:

Suppose I have a dataframe:

name = c("John", "John","John","John","Mark","Mark","Mark","Mark","Dave", "Dave","Dave","Dave")
color = c("red", "blue", "green", "yellow","red", "blue", "green", "yellow","red", "blue", "green", "yellow") 
value = c( 1,2,1,3,5,5,3,2,4,6,7,8)
df = data.frame(name, color, value)
#View(df)
df
#     name  color value
#  1  John    red     1
#  2  John   blue     2
#  3  John  green     1
#  4  John yellow     3
#  5  Mark    red     5
#  6  Mark   blue     5
#  7  Mark  green     3
#  8  Mark yellow     2
#  9  Dave    red     4
#  10 Dave   blue     6
#  11 Dave  green     7
#  12 Dave yellow     8

and I want it to look like this:

#   names red blue green yellow
#1   John   1    2     1      3
#2   Mark   5    5     3      2
#3   Dave   4    6     7      8

That is, the entries in the first column (name) will become unique and the levels in the second column (color) will be new columns and the entries that will be in these new columns will come from the corresponding rows in the third column (value) in the original data frame.

I can accomplish this using the following:

library(dplyr)
  df = df %>%
  group_by(name) %>%
  mutate(red = ifelse(color == "red", value, 0.0),
         blue = ifelse(color == "blue", value, 0.0),
         green = ifelse(color == "green", value, 0.0),
         yellow = ifelse(color == "yellow", value, 0.0)) %>%
  group_by(name) %>%
  summarise_each(funs(sum), red, blue, green, yellow)
df
    name   red  blue green yellow
1   Dave     4     6     7      8
2   John     1     2     1      3
3   Mark     5     5     3      2

But this would not be ideal if there are lots of levels in the color column. How would I go on doing that?

Thank you!

nicola
  • 24,005
  • 3
  • 35
  • 56
chowching
  • 113
  • 3

3 Answers3

4

As the OP is using dplyr family of packages, a nice option is with tidyr

library(tidyr)
spread(df, color, value)
#    name blue green red yellow
#1 Dave    6     7   4      8
#2 John    2     1   1      3
#3 Mark    5     3   5      2

If we need to use %>%

library(dplyr)
df %>% 
    spread(color, value)

To keep the order, we can convert the 'color' to factor class with levels specified as the unique values of 'color' and then do the spread

df %>%
   mutate(color = factor(color, levels = unique(color))) %>%
   spread(color, value)
#  name red blue green yellow
#1 Dave   4    6     7      8
#2 John   1    2     1      3
#3 Mark   5    5     3      2

Or we can use data.table with a faster dcast. Converting to data.table and using the dcast from data.table have advantages. It is much much faster than the dcast from reshape2.

library(data.table)
dcast(setDT(df), name~color, value.var="value")
#   name blue green red yellow
#1: Dave    6     7   4      8
#2: John    2     1   1      3
#3: Mark    5     3   5      2

NOTE: In both the solutions, we get the column names as in the expected output and do not have any uglier suffix or prefix attached to it (which BTW can be changed, but it is another line of code)


If we need a base R, one option is tapply

with(df, tapply(value, list(name, color), FUN = I))
#     blue green red yellow
#Dave    6     7   4      8
#John    2     1   1      3
#Mark    5     3   5      2
akrun
  • 874,273
  • 37
  • 540
  • 662
3

So you want a cross tab then?

> xtabs(value~name+color, df)
      color
name   blue green red yellow
  Dave    6     7   4      8
  John    2     1   1      3
  Mark    5     3   5      2
Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
3

You can use dcast from reshape2 package

library(reshape2)
dcast(df, name~color)


#  name blue green red yellow
#1 Dave    6     7   4      8
#2 John    2     1   1      3
#3 Mark    5     3   5      2

Or else you can reshape from base R

reshape(df, idvar="name", timevar="color", direction="wide")


#  name value.red value.blue value.green value.yellow
#1 John         1          2           1            3
#5 Mark         5          5           3            2
#9 Dave         4          6           7            8
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213