1

I have a data frame with one column (x1) containing string values. I am using these string values to modify the corresponding logical values of other columns in the data frame (the other columns are named after possible sub-string values in column x1's strings - i.e., 'Dog', 'Cat', 'Bird').

I've already figured out how to use strsplit() to parse each string in column x1. I also know how to mutate/modify the other three columns based on those strsplit() results.

What I'm currently stuck on is how to apply the for loop below to each row in my data frame.

x0 <- c(1,2,3,4,5)
x1 <- c("Dog, Cat", "Cat", "Dog, Bird", "Cat, Bird, Dog", "Cat, Bird")
Dog <- c(rep(FALSE, 5))
Cat <- c(rep(FALSE, 5))
Bird <- c(rep(FALSE, 5))

example_df <- data.frame(x0, x1, Dog, Cat, Bird)

for(i in 1:length(strsplit(example_df$x1) )){
  example_df[[strsplit(example_df$x1)[i]]] <- TRUE
}

So for the example above, I want my code to change the first row of my data frame to have example_df$Dog and example_df$Cat to both be TRUE, but example_df$Bird would still be FALSE for that row. The second row would only have example_df$Cat as TRUE, etc.

Another note: for the example above, I've only provided three animal string values. But I'm looking for a method that will scale adequately for large number of string values. I know it's possible to use copy and paste with this method:

example_df %>%
    mutate(Dog = str_contains(x1, "Dog"))

But unfortunately, this method is not going to scale well if I have 10 or 20 possible animal substring values in column x1.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214

2 Answers2

0

We can use mtabulate after strsplit

library(qdapTools)
m1 <- mtabulate(strsplit(example_df$x1, ",\\s+")) > 0
example_df[names(example_df)[3:5]] <- m1[, names(example_df)[3:5]]

-output

example_df
#  x0             x1   Dog   Cat  Bird
#1  1       Dog, Cat  TRUE  TRUE FALSE
#2  2            Cat FALSE  TRUE FALSE
#3  3      Dog, Bird  TRUE FALSE  TRUE
#4  4 Cat, Bird, Dog  TRUE  TRUE  TRUE
#5  5      Cat, Bird FALSE  TRUE  TRUE

Or another option is cSplit_e

library(splitstackshape)
cSplit_e(example_df[1:2], "x1", sep = ",", type = "character", fill = 0, drop = TRUE)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • When I check example_df after ```example_df[names(example_df)[3:5] <- m1[, names(example_df)[3:5]]``` the changes do not appear in the data frame? – rlang_dog_cat Sep 22 '20 at 19:02
  • @rlang_dog_cat there was a typo in my code, i.e. a missing `]` Can you check now. thanks – akrun Sep 22 '20 at 19:03
  • Thanks, it worked! I'm also curious - are there alternative methods if the column indices of the string value names aren't known in advance? For example if I do not know Dog, Cat, and Bird are columns 3 through 5 in example_df? I know that you can always manually find the column indices, but I was just wondering if there's the case that the columns needed are spread out through the data frame or if there's too many columns to manually check etc. – rlang_dog_cat Sep 22 '20 at 19:24
  • @rlang_dog_cat i added another method with `cSplit_e` – akrun Sep 22 '20 at 19:33
0

Adapting this answer to your example. You can use separate_rows then pivot_wider.

library(tidyverse)

example_df %>%
  select(x0, x1) %>%
  separate_rows(x1) %>%
  pivot_wider(names_from = x1, 
              values_from = x1, 
              values_fn = list(x1 = is.character),
              values_fill = list(x1 = FALSE))


#----------
# A tibble: 5 x 4
     x0 Dog   Cat   Bird 
  <dbl> <lgl> <lgl> <lgl>
1     1 TRUE  TRUE  FALSE
2     2 FALSE TRUE  FALSE
3     3 TRUE  FALSE TRUE 
4     4 TRUE  TRUE  TRUE 
5     5 FALSE TRUE  TRUE 

nniloc
  • 4,128
  • 2
  • 11
  • 22