2

A survey was conducted and one of the questions had an option to select multiple answers. When selecting multiple answers, they were all recorded in the same cell.

Additionally, the way that this information was recorded in the cell was different for each surveyor. Sometimes the delimiter was a hyphen (-) and othertimes it was a front slash (/). Also some surveyors listed the items with numbers.

An example is a list of items in a house (see below/image). I would like to create columns when each item is available (the new columns could either have 1/0 or the item name/NA) (see below outcome example).

I could do this in excel using text-to-column and lookup arrays, but there are so many excel sheets with this same column that I'd have to do it in R. Sorry I don't know how to make the example table using R code, but hopefully someone would be able to help anyway.

This is what the data looks like:

House = c("h1","h2","h3","h4","h5","h6","h7","h8","h9","h10","h11")
Items = c("Chair", "Chair- Window/Glass- "," Door- Sofa-", "Chair- 
Window/Glass Frame- ", "1. Window/Glass Frame", "Chair- Door- Window-", "Chair- Sofa - Door- Table-", " 4. Table", "Couch (2)", "Window- Table- Chair- Sofa- Door- Couach", "2. Door / Chair")
table1 = as.data.table(House)
table2 = as.data.table(Items)
table = as.data.frame(append(table1, table2))

table

+-------+------------------------------------------+
| House |                  Items                   |
+-------+------------------------------------------+
|   001 | Chair                                    |
|   002 | Chair- Window/Glass-                     |
|   003 | Door- Sofa-                              |
|   004 | Chair- Window/Glass Frame-               |
|   005 | 1. Window/Glass Frame                    |
|   006 | Chair- Door- Window-                     |
|   007 | Chair- Sofa - Door- Table-               |
|   008 | 4. Table                                 |
|   009 | Couch (2)                                |
|   010 | Window- Table- Chair- Sofa- Door- Couach |
|   011 | 2. Door / Chair                          |
+-------+------------------------------------------+

My thought was to split using all delimiters (strsplit), remove whitespaces (trimws), get a unique list (unique), then replace all variations with the standard I want (grepl), and finally put them into columns according to categories.

items <- strsplit(df$Items, "[/.-]")
items <- trimws(items)
items <- df$Items %>%
    strsplit("[/.-]") %>%
    str_trim(side = "both")
items_list <- unique(items)

This is what I am trying to get: (Window and glass are the same, and chair/sofa/couch are the same, etc -- so I just need to create larger categories instead of having several columns of essentially the same thing)

Outcome

+-------+-------+--------+-------+------+
| House | Chair | Window | Table | Door |
+-------+-------+--------+-------+------+
|   001 | Chair |        |       |      |
|   002 | Chair | Window |       |      |
|   003 | Chair |        |       | Door |
|   004 | Chair | Window |       |      |
|   005 |       | Window |       |      |
|   006 | Chair | Window |       | Door |
|   007 | Chair |        | Table | Door |
|   008 |       |        | Table |      |
|   009 | Chair |        |       |      |
|   010 | Chair | Window | Table | Door |
|   011 | Chair |        |       | Door |
+-------+-------+--------+-------+------+
kt2630
  • 23
  • 4

1 Answers1

2

You can use str_detect (or grepl) in map_df (or sapply) to generate a dataframe of logicals, coerce those to integers 0/1, and then bind that to your original dataframe. This method bypasses the hassle of splitting/cleaning/etc. the data. It simply requires that you first create pattern groups for your regex, i.e. chair|sofa|couach|couch, window|glass:

library(stringr)
library(dplyr)
library(purrr)

# Create regex pattern groups.
patts <- c(chair = "chair|sofa|couach|couch", window = "window|glass", 
           table = "table", door = "door")

# Detect pattern groups, coerce to 0/1, bind to origional dataframe.
map_df(patts, ~ str_detect(df$Items, regex(., ignore_case = T))) %>%
    mutate_all(as.integer) %>% 
    bind_cols(df, .)

This returns the following dataframe:

# A tibble: 11 x 6
   House Items                                    chair window table  door
   <dbl> <chr>                                    <int>  <int> <int> <int>
 1     1 Chair                                        1      0     0     0
 2     2 "Chair- Window/Glass- "                      1      1     0     0
 3     3 " Door- Sofa-"                               1      0     0     1
 4     4 "Chair- Window/Glass Frame- "                1      1     0     0
 5     5 1. Window/Glass Frame                        0      1     0     0
 6     6 Chair- Door- Window-                         1      1     0     1
 7     7 Chair- Sofa - Door- Table-                   1      0     1     1
 8     8 " 4. Table"                                  0      0     1     0
 9     9 Couch (2)                                    1      0     0     0
10    10 Window- Table- Chair- Sofa- Door- Couach     1      1     1     1
11    11 2. Door / Chair                              1      0     0     1 

Data:

df <- tibble(House = c(1,2,3,4,5,6,7,8,9,10,11), Items = c("Chair", "Chair- Window/Glass- "," Door- Sofa-", "Chair- Window/Glass Frame- ", "1. Window/Glass Frame", "Chair- Door- Window-", "Chair- Sofa - Door- Table-", " 4. Table", "Couch (2)", "Window- Table- Chair- Sofa- Door- Couach", "2. Door / Chair"))
  • 1
    This is a great solution and really nice way of looking at the problem! Worked perfectly - thanks @gersht! Just out of curiosity (and to learn a bit more), instead of 0/1, is there an easy way to just have the column/category name in place of 1 and empty in the 0 (like my output example)? If this would be too complicated than that's ok, since this is perfect! – kt2630 Jun 27 '19 at 22:42
  • 1
    @kt2630 glad to hear it helps! To use the names just replace `mutate_all(as.integer)` with `map2_df(names(.), ~ ifelse(.x, .y, ""))`. Just be aware that this isn't the most useful format for the data. It's easier to work with TRUE/FALSE, or maybe with 1/0 for certain applications. If you only need the names for display purposes then that's fine, of course. –  Jun 28 '19 at 08:13