0

I am trying to create a new column in my dataset. So far, I have imported a JSON file into R with a column full of different words ("purple", "red", "blue", etc.) each observation has some combination of these words. My goal is to create a new column with the header of the words apparent ("purple", "red", "blue", etc.). I want that column to have True or Falses depending on whether the observations exhibit that color. I tried to use the subset function as well as manually doing so but there are over 300 different observations, making that very inconvenient. I greatly appreciate any help!

For example:

Observations     Color
1                Blue
2                Red, Blue
3                Blue, Green
4                Purple
5                Yellow, Orange

and now I want

Observations     Color       Red       Yellow        Orange    Blue
1                Blue        False     False         False     True
2                Red, Blue   True      False         False     True

etc

This is my first question on this site so i apologize if there are any problems.

Amber
  • 31
  • 2
  • 5
  • 3
    Please see ["How to make a great R reproducible example?"](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It will help people reply to your question in a more useful manner. Also I'm pretty sure this has been asked a few times already... – nico Jun 30 '15 at 12:43
  • For instance: http://stackoverflow.com/questions/7658316/create-new-column-based-on-4-values-in-another-column and http://stackoverflow.com/questions/26278127/create-new-column-based-on-values-in-2-other-columns (your solution may be easier but the idea is the same). – nico Jun 30 '15 at 12:44
  • thank you for your help! – Amber Jun 30 '15 at 12:46
  • 1
    possible duplicate of [Split string in rows to create new columns](http://stackoverflow.com/questions/27630588/split-string-in-rows-to-create-new-columns) – user20650 Jun 30 '15 at 12:57
  • 2
    From duplicate... try `splitstackshape::cSplit_e(dat, "Color", ",", type = "character", fill = 0)` – user20650 Jun 30 '15 at 12:59

3 Answers3

2

You can simply iterate over the column names you want to create and use grepl to find if they are present in the Color column:

dat <- read.table(text="Observations     Color
1                Blue
                  2                Red,Blue
                  3                Blue,Green
                  4                Purple
                  5                Yellow,Orange", header=T, stringsAsFactors=F)
# I removed the space after the commas to facilitate the data.frame creation.

cols <- c("Red", "Yellow", "Orange", "Blue")

for (i in cols) dat[[i]] <- grepl(i, dat$Color)

Result:

> dat
  Observations         Color   Red Yellow Orange  Blue
1            1          Blue FALSE  FALSE  FALSE  TRUE
2            2      Red,Blue  TRUE  FALSE  FALSE  TRUE
3            3    Blue,Green FALSE  FALSE  FALSE  TRUE
4            4        Purple FALSE  FALSE  FALSE FALSE
5            5 Yellow,Orange FALSE   TRUE   TRUE FALSE

Edit:

If you want columns for all colors, a better way to create the vector is as proposed by Robert in the comments:

cols <- unique(unlist(strsplit(dat$Color, ",")))
#You might have to change from "," to ", " if you have white spaces after the commas
#or even ",\\s?" if they aren't always there.

The new result will be:

  Observations         Color   Red Yellow Orange  Blue Green Purple
1            1          Blue FALSE  FALSE  FALSE  TRUE FALSE  FALSE
2            2      Red,Blue  TRUE  FALSE  FALSE  TRUE FALSE  FALSE
3            3    Blue,Green FALSE  FALSE  FALSE  TRUE  TRUE  FALSE
4            4        Purple FALSE  FALSE  FALSE FALSE FALSE   TRUE
5            5 Yellow,Orange FALSE   TRUE   TRUE FALSE FALSE  FALSE
Molx
  • 6,816
  • 2
  • 31
  • 47
0

Try something like this:

example <- data.frame(colors=c("A,B", "A", "B", "F", "C", "C,G", "C", "D", "E", "F"),stringsAsFactors = F)
cols <- sort(unique(unlist(strsplit(example$colors, ",", fixed = TRUE))))
dummies= sapply(cols,function(co)grepl(co, example$colors))

          A     B     C     D     E     F     G
 [1,]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
 [2,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [3,] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
 [4,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [5,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [6,] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
 [7,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [8,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [9,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
Robert
  • 5,038
  • 1
  • 25
  • 43
0

The base R explanation is much simpler than I can achieve with dplyr, but for interest here's a dplyr solution:

cols <- unique(unlist(strsplit(dat$Color, ",", fixed = TRUE)))
dat %>% mutate_(.dots = sapply(cols, function(col) interp(~grepl(col, Color), col = col)))

And here's a way using plyr and magrittr:

cols %>% 
  laply(grepl, dat$Color) %>%
  t %>%
  data.frame %>%
  setNames(cols) %>%
  cbind(dat, .)

And another:

dat %>% adply(1, . %$%
                Color %>%
                strsplit(",") %>%
                extract2(1) %>%
                factor(levels = cols) %>%
                table %>%
                is_greater_than(0))

This makes use of the fact that magrittr lets you create an anonymous function chain.

Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52