2

Edit: there was a typo in my df creation, with a missing _ on the last value of MediaName; this is now corrected.

I want to create a new variable TrialId in a data frame as part of the value of another variable MediaName depending on the value of a third variable Phase, and thought I could do that using strsplit and ifelse within a dplyr::mutate as follows:

library(dplyr)

# Creating a simple data frame for the example
df <- data.frame(Phase = c(rep("Familiarisation",8),rep("Test",3)),
                 MediaName = c("Flip_A1_G1","Reg_B2_S1","Reg_A2_G1","Flip_B1_S1",
                               "Reg_A1_G2","Flip_B2_S2","Reg_A2_G2","Flip_B1_S2",
                               "HC_A1L","TC_B1R","RC_BL_2R"))

# Creating a new column
df <- df %>%
  mutate(TrialId = ifelse(Phase == "Familiarisation",
                          sapply(strsplit(MediaName, "_"), "[", 2),
                          sapply(strsplit(MediaName, "_"), "[", 1)))

The expected result being

> df$TrialId
[1] "A1" "B2" "A2" "B1" "A1" "B2" "A2" "B1" "HC" "TC" "RC"

However this gives me the following error because, I believe, of the strsplit:

Error in mutate_impl(.data, dots) : 
  Evaluation error: non-character argument.

I know from this SO question that I can easily solve my issue by defining, in this small example, my data frame as a tibble::data_frame, without knowing why this solves the issue. I can't do exactly that though as in my actual code df comes from reading a csv file (with read.csv()). I have been thinking that using df <- df %>% as_tibble() %>% mutate(...) would solve the issue in a similar way, but it doesn't (why?).

Is there a way to actually use tibble even when reading files? Or is there another way of achieving what I need to do, without using strsplit maybe?

I'm also reading on this other SO question that you can use tidyr::separate but it isn't doing exactly what I want as I need to keep either the first or second value depending on the value of Phase.

Arthur Spoon
  • 442
  • 5
  • 18
  • 1
    Perhaps you have `factor` class. Try to convert to `character` and then do the `ifelse` i.e. `df %>% mutate_all(as.character) %>%` before the `mutate` call – akrun Dec 06 '17 at 13:34
  • Damn it, this felt so obvious that I didn't even try it, but it works... I'm still interested in understanding why using `tibble::data_frame` makes it work but not using `as_tibble` though. – Arthur Spoon Dec 06 '17 at 13:40
  • 1
    The main reason is the `data_frame` by default gives `character` class for all those non-numeric columns, while using `data.frame` it is `stringsAsFactors=TRUE` by default`. and since you are using `as_tibble` it won't change the column class created by `data.frame` – akrun Dec 06 '17 at 13:41

2 Answers2

2

You can try:

library(tidyverse)
# your first data 
df_old <- data.frame(Phase = c(rep("Familiarisation",8),rep("Test",3)),
                 MediaName = c("Flip_A1_G1","Reg_B2_S1","Reg_A2_G1","Flip_B1_S1",
                               "Reg_A1_G2","Flip_B2_S2","Reg_A2_G2","Flip_B1_S2",
                               "HC_A1L","TC_B1R","RC_BL2R"))
df_old %>% 
  separate(MediaName, into=letters[1:3], sep="_", fill = "left", remove = FALSE) %>% 
  select(Phase, MediaName, TrialId=b)
             Phase  MediaName TrialId
1  Familiarisation Flip_A1_G1      A1
2  Familiarisation  Reg_B2_S1      B2
3  Familiarisation  Reg_A2_G1      A2
4  Familiarisation Flip_B1_S1      B1
5  Familiarisation  Reg_A1_G2      A1
6  Familiarisation Flip_B2_S2      B2
7  Familiarisation  Reg_A2_G2      A2
8  Familiarisation Flip_B1_S2      B1
9             Test     HC_A1L      HC
10            Test     TC_B1R      TC
11            Test    RC_BL2R      RC

It is a hardcoded solution according the provided sample data. Separate by "_", if there are onyl two instead of three "_" fill NAs from the left side. Finally, select the columns you need.

Edit

With your new data it is somewhat more complicated. but you can try:

df %>% 
  add_column(MediaName_keep=df$MediaName) %>% 
  group_by(MediaName_keep) %>% 
  separate_rows(MediaName, sep="_") %>% 
  mutate(n=1:n()) %>% 
  filter((Phase == "Familiarisation" & n == 2) | (Phase == "Test" & n == 1)) %>% 
  select(Phase, MediaName=MediaName_keep, TrialId=MediaName)
# A tibble: 11 x 3
# Groups:   MediaName [11]
             Phase  MediaName TrialId
            <fctr>     <fctr>   <chr>
 1 Familiarisation Flip_A1_G1      A1
 2 Familiarisation  Reg_B2_S1      B2
 3 Familiarisation  Reg_A2_G1      A2
 4 Familiarisation Flip_B1_S1      B1
 5 Familiarisation  Reg_A1_G2      A1
 6 Familiarisation Flip_B2_S2      B2
 7 Familiarisation  Reg_A2_G2      A2
 8 Familiarisation Flip_B1_S2      B1
 9            Test     HC_A1L      HC
10            Test     TC_B1R      TC
11            Test   RC_BL_2R      RC

The idea is the same. Separate, but at this time add and count the new rows by MediaName_keep, then filter according your needs.

Roman
  • 17,008
  • 3
  • 36
  • 49
  • I really like this neat answer, however it doesn't work because of a typo in my `df` definition (corrected now): I sometimes do have two `_` in `MediaName` when `Phase == "Test"`, but still need the first, not second, value in those cases... :/ – Arthur Spoon Dec 06 '17 at 13:56
1

The problem you encountered is because the string was automatically converted in a factor, therefore you cannot apply strsplit() to a non-string object. My solution simply convert the MediaName into a string type.

require(dplyr)    
df <- df %>%
        dplyr::mutate(MediaName = as.character(levels(df$MediaName))[df$MediaName]) %>%
                dplyr::mutate(TrialId = ifelse(Phase == "Familiarisation",
                                        sapply(strsplit(MediaName, "_"), "[", 2),
                                        sapply(strsplit(MediaName, "_"), "[", 1))) 





solution<- c("A1", "B2", "A2", "B1", "A1", "B2", "A2", "B1", "HC", "TC", "RC")
identical(solution, df$TrialId)
[1] TRUE
Seymour
  • 3,104
  • 2
  • 22
  • 46
  • Can you expand your answer? I don't see how to do that without `mutate`... – Arthur Spoon Dec 06 '17 at 13:41
  • @ArthurSpoon I am not presenting you a Mutate solution, however, my idea solve you problem. Furthermore, consider using 'df$MediaName <- as.character(levels(df$MediaName))[df$MediaName]' because you 'MediaName' is stored as 'factor' and to do 'strsplit' you need a string!! – Seymour Dec 06 '17 at 13:44
  • Yeah I know that, @akrun has made that point earlier. I just assumed that `strsplit` would behave in the same way as `grepl`, accepting a factor, when it actually doesn't (and the documentation says it, whoops). – Arthur Spoon Dec 06 '17 at 13:48
  • What I did was simply to add `mutate(MediaName = as.character(levels(df$MediaName))[df$MediaName]) %>%' before your code :) – Seymour Dec 06 '17 at 13:52