Custom factor levels in a concatenated string

Question

I have a factor variable which is composed of two substrings separated by a _, like string1_string2. I want to set factor levels of the prefix ("string1") and suffix ("string2") separately, and then define an overall set of factor levels for the concatenated string. In addition, the precedence of levels in the first vs the second substring may vary.

A small example of what I want to achieve:

# reproducible data

x <- factor(c("DBO_A", "PH_A", "COND_A", "DBO_B", "PH_B", "COND_B", "DBO_C", "PH_C", "COND_C"))

[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
Levels: COND_A COND_B COND_C DBO_A DBO_B DBO_C PH_A PH_B PH_C

If I don't define the factor levels, they will be ordered alphabetically. Now I want to set the levels of the strings on the left and right side of the _ separator, e.g.

PH < COND < DBO on the left side (LHS).
B < A < C on the right side (RHS).

In addition, I want to specify which side, LHS or RHS, has precedence over the other. Depending on which side has precedence, the overall order of levels will differ:

(1) If levels on LHS is precedent:

[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C

(2) If levels on RHS is precedent:

[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C

Now I just one thought to solve it such like factor(x, levels = c(xx, xx, ...)), but I have more levels than the above shows, so this will look ridiculous.

Note: I don't want to change the order of my data, only the order of the levels.

akrun · Accepted Answer · 2019-01-06T18:55:46.697

We can use base R to do this. Using sub remove the substring in the levels of the vector, with match create a numeric index by checking those values that are in the custom order, reassign the levels of the factor by ordering the sequence of levels of vector based on the matching index

i1 <- match(sub("_.*", "", levels(x)), c("PH", "COND", "DBO"))
i2 <- match(sub(".*_", "", levels(x)), c("B", "A", "C"))
factor(x, levels = levels(x)[seq_along(levels(x))[order(i1, i2)]])

For the second case, just reverse the index in order

factor(x, levels = levels(x)[seq_along(levels(x))[order(i2, i1)]])

For repeated use, can be wrapped in a function

f1 <- function(vec, lvls1, lvls2, flag = "former") {
   i1 <- match(sub("_.*", "", levels(vec)), lvls1)
   i2 <- match(sub(".*_", "", levels(vec)), lvls2)

   if(flag == 'former') {
     factor(vec, levels = levels(vec)[seq_along(levels(vec))[order(i1, i2)]])
   } else {
     factor(vec, levels = levels(vec)[seq_along(levels(vec))[order(i2, i1)]])

   }


}

f1(x, c("PH", "COND", "DBO"), c("B", "A", "C"))
#[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
#Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C


f1(x, c("PH", "COND", "DBO"), c("B", "A", "C"), flag = "latter")
#[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
#Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C

hey, your method is great but you change my original data. You can compare my expected data and your output. — Darren Tsai, Jan 06 '19 at 18:55
@DarrenTsai Sorry, I forgot to check the output. Thanks for pointing that out. Fixed it — akrun, Jan 06 '19 at 19:00

Rui Barradas · Answer 2 · 2019-01-06T18:58:29.483

Using CRAN package forcats you can combine a list of factors. The function below expects as input 2 vectors, prefix and suffix, in the order you want them.
Argument sep = "_" has its default set to the separator in the question. You can pass another separator if you want to.

library(forcats)

custom_fct <- function(prefix, suffix, sep = "_"){
  lst <- lapply(prefix, function(p){
    f <- paste(p, suffix, sep = sep)
    factor(f, levels = f)
  })
  fct_c(!!!lst)
}

x <- c("PH", "COND", "DBO")
y <- c("B", "A", "C")

custom_fct(x, y)

Edit.

Another way of seeing the problem, that I only understood after the OP's comment, is to have an input data vector x to be coerced to factor and 2 vectors, one of prefixes and one of suffixes. The following function creates such a vector and does not need an external package.

custom_fct2 <- function(x, prefix, suffix, sep = "_"){
  lst <- lapply(prefix, function(p){
    paste(p, suffix, sep = sep)
  })
  factor(x, levels = unlist(lst))
}

x <- c("DBO_A", "PH_A", "COND_A", "DBO_B",
       "PH_B", "COND_B", "DBO_C", "PH_C", "COND_C")
a <- c("PH", "COND", "DBO")
b <- c("B", "A", "C")

custom_fct2(x, a, b)
#[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C  
#[9] COND_C
#9 Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B ... DBO_C

`x` in my question is a given data. What I want is how to use `x` to get the two expected data. — Darren Tsai, Jan 06 '19 at 18:44
@DarrenTsai I don't understand, `x` is given, yes, and `y` is not? You want to get `y` from `x`? The function in my answer does what you describe in the question. — Rui Barradas, Jan 06 '19 at 18:49
But how can I get my second expected output? `custom_fct2(x, b, a)` ? — Darren Tsai, Jan 06 '19 at 19:12

Henrik · Answer 3 · 2019-01-06T23:21:55.983

Using data.table convenience functions tstrsplit and setorderv.

Create a vector of (arbitrary) column names for the substrings (cols <- c("V1", "V2")). Convert the vector to a data.table (d <- data.table(x)). Split the vector into two columns ((cols) := tstrsplit(x, split = "_")). Set factor levels of substrings (factor(V1, levels = l1)). Order data either by the first substring then the second substring, or by the second and then the first (setorderv(d, if(prec == 1) cols else rev(cols))). Use the ordered column 'x' from the data.table as factor levels of the vector 'x' (factor(x, levels = d$x)).

library(data.table)

f <- function(x, l1, l2, prec){
  cols <- c("V1", "V2")
  d <- data.table(x)
  d[ , (cols) := tstrsplit(x, split = "_")]
  d[ , `:=`(
    V1 = factor(V1, levels = l1),
    V2 = factor(V2, levels = l2))]
  setorderv(d, if(prec == 1) cols else rev(cols))
  factor(x, levels = d$x)
}

# First substring has precedence
f(x, l1 = c("PH", "COND", "DBO"), l2 = c("B", "A", "C"), prec = 1)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C

# Second substring has precedence
f(x, l1 = c("PH", "COND", "DBO"), l2 = c("B", "A", "C"), prec = 2)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C

A base alternative, in a similar vein, but placing substrings in a matrix instead. Use standard regex (see e.g. here) to grab substrings. Convert to factor and set levels. Create column index (i <- c(1, 2, 1)[prec:(prec + 1)]). Order levels of 'x' (as.character(x)[order(m[ , i[1]], m[ , i[2]])])).

f2 <- function(x, l1, l2, prec){
  m <- cbind(factor(sub("_.*", "", x), l1), factor(sub(".*_", "", x), l2))
  i <- c(1, 2, 1)[prec:(prec + 1)]
  factor(x, levels = as.character(x)[order(m[ , i[1]], m[ , i[2]])])}

f2(x, l1, l2, prec = 1)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C

f2(x, l1, l2, prec = 2)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C

It's what I need. Thank you so much for this answer and the revise to my question. — Darren Tsai, Jan 06 '19 at 20:49
I see it. It works as I want. I'm grateful to your kindness so much. — Darren Tsai, Jan 07 '19 at 05:45

score -1 · Answer 4 · answered Jan 06 '19 at 18:19

-1

How abut something like

x <- with(expand.grid(x = c("DBO", "PH", "COND"), y = c("A", "B", "C")),
          factor(paste(x, y, sep = "_"), levels = paste(x, y, sep = "_")))

You don't need to write out every possible level, just the levels of one side and the other.

answered Jan 06 '19 at 18:19

Joseph Clark McIntyre

1,094
1
6
6

Can you say how not? Given how you used expand.grid, paste will order first by x and second by y, which is how I understood what you needed. – Joseph Clark McIntyre Jan 06 '19 at 18:28
The main point is not how I produce my `x`. It's just a reproducible data. Please read my expected output. – Darren Tsai Jan 06 '19 at 18:31

Custom factor levels in a concatenated string

4 Answers4