1

I have a data frame that looks like this:

   ID Time           Item
1 S001   P1           1/2/
2 S002   P1       2/10/7/9
3 S003   P1 1/2/4/5/6/10/9
4 S004   P1 1/2/5/6/10/7/9
5 S005   P1     1/2/10/7/9
6 S006   P1      2/5/6/7/9

I want to search the 'Item' column, and create a new column where if the Item column contains a 1, then the new column =1 and if the item column doesn't contain a 1, then the new column =0. This is similar to the grepl function, but I want it to input 1s and 0s instead of TRUE and FALSE.

That is, my data set will look like this:

ID Time           Item         Item1
    1 S001   P1           1/2/    1
    2 S002   P1       2/10/7/9    0
    3 S003   P1 1/2/4/5/6/10/9    1
    4 S004   P1 1/2/5/6/10/7/9    1
    5 S005   P1     1/2/10/7/9    1
    6 S006   P1      2/5/6/7/9    0

I want to do this all the way up to ten columns (the idea is to turn the 'Item' column into a matrix of ones and zeros).

       ID Time          Item Item1 Item2 Item3 Item4 Item5 Item6 Item7 
   1 S001   P1           1/2/    1   1    0     0     0     0     0    

   2 S002   P1       2/10/7/9    0   1    0     0     0     0     1     

   3 S003   P1 1/2/4/5/6/10/9    1   1    0     1     1     1     0   
Jaap
  • 81,064
  • 34
  • 182
  • 193
Bailey
  • 131
  • 2
  • 11
  • 1
    `library(tidyverse); df %>% mutate(Item_split = Item) %>% separate_rows(Item_split, convert = TRUE) %>% drop_na(Item_split) %>% mutate(Item_split = paste0('Item', Item_split), i = 1) %>% spread(Item_split, i, fill = 0L)`, though this is mostly a dupe of [Generate a dummy variable](https://stackoverflow.com/q/11952706/4497050) – alistaire Feb 13 '18 at 21:46

2 Answers2

1

A direct solution is simply to use str_detect or the equivalent grepl (not as familiar) and then use as.numeric to turn TRUE into 1 and FALSE into 0. EDIT: added some lookarounds to make the regular expression more robust. (?<!\\d)1(?!\\d) now checks to make sure there is not a digit before or after the 1. However, I think the second approach below is safer.

library(tidyverse)
tbl <- read_table2(
"  ID Time           Item
 S001   P1           1/2/
 S002   P1       2/10/7/9
 S003   P1 1/2/4/5/6/10/9
 S004   P1 1/2/5/6/10/7/9
 S005   P1     1/2/10/7/9
 S006   P1      2/5/6/7/9"
)

tbl %>% mutate(
  Item1 = as.integer(str_detect(Item, "(?<!\\d)1(?!\\d)"))
)
# A tibble: 6 x 4
  ID    Time  Item           Item1
  <chr> <chr> <chr>          <int>
1 S001  P1    1/2/               1
2 S002  P1    2/10/7/9           1
3 S003  P1    1/2/4/5/6/10/9     1
4 S004  P1    1/2/5/6/10/7/9     1
5 S005  P1    1/2/10/7/9         1
6 S006  P1    2/5/6/7/9          0

However, you can also use other tidyverse tools to do the end transformation you want. Here we:

  1. Use separate_rows to place each Item onto its own row,
  2. Remove the empty rows created by a line ending in / with filter,
  3. Add a presence column with mutate,
  4. spread the rows back out to make the grid of values that you want,
  5. Replace the NA values with 0.

spread will basically turn the values of Item into column headings, and then place the values of presence as values in those new columns, leaving NA in the blanks.

tbl %>%
  separate_rows(Item, sep = "/") %>%
  filter(Item != "") %>%
  mutate(present = 1) %>%
  spread(Item, present, sep = "") %>%
  mutate_all(function(x) replace(x, is.na(x), 0))
# A tibble: 6 x 10
  ID    Time  Item1 Item10 Item2 Item4 Item5 Item6 Item7 Item9
  <chr> <chr> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 S001  P1     1.00   0     1.00  0     0     0     0     0   
2 S002  P1     0      1.00  1.00  0     0     0     1.00  1.00
3 S003  P1     1.00   1.00  1.00  1.00  1.00  1.00  0     1.00
4 S004  P1     1.00   1.00  1.00  0     1.00  1.00  1.00  1.00
5 S005  P1     1.00   1.00  1.00  0     0     0     1.00  1.00
6 S006  P1     0      0     1.00  0     1.00  1.00  1.00  1.00
Calum You
  • 14,687
  • 4
  • 23
  • 42
  • 3
    You'll need a more complete regex, as the `10` in the second row is getting recognized as a match for `1` – alistaire Feb 13 '18 at 21:47
1

A solution using and .

library(dplyr)
library(tidyr)

dat2 <- dat %>%
  separate_rows(Item, convert = TRUE) %>%
  mutate(Value = 1L) %>%
  complete(ID, Time, Item = 1:10, fill = list(Value = 0L)) %>%
  mutate(Item = paste0("Item", Item)) %>%
  spread(Item, Value) %>%
  select(ID, Time, paste0("Item", 1:10))
dat2
# # A tibble: 6 x 12
#   ID    Time  Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9 Item10
#   <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>  <int>
# 1 S001  P1        1     1     0     0     0     0     0     0     0      0
# 2 S002  P1        0     1     0     0     0     0     1     0     1      1
# 3 S003  P1        1     1     0     1     1     1     0     0     1      1
# 4 S004  P1        1     1     0     0     1     1     1     0     1      1
# 5 S005  P1        1     1     0     0     0     0     1     0     1      1
# 6 S006  P1        0     1     0     0     1     1     1     0     1      0

DATA

dat <- read.table(text = "   ID Time           Item
1 S001   P1           '1/2'
2 S002   P1       '2/10/7/9'
3 S003   P1 '1/2/4/5/6/10/9'
4 S004   P1 '1/2/5/6/10/7/9'
5 S005   P1     '1/2/10/7/9'
6 S006   P1      '2/5/6/7/9'",
                  header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84