4

I have a factor variable which is composed of two substrings separated by a _, like string1_string2. I want to set factor levels of the prefix ("string1") and suffix ("string2") separately, and then define an overall set of factor levels for the concatenated string. In addition, the precedence of levels in the first vs the second substring may vary.


A small example of what I want to achieve:

# reproducible data

x <- factor(c("DBO_A", "PH_A", "COND_A", "DBO_B", "PH_B", "COND_B", "DBO_C", "PH_C", "COND_C"))

[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
Levels: COND_A COND_B COND_C DBO_A DBO_B DBO_C PH_A PH_B PH_C

If I don't define the factor levels, they will be ordered alphabetically. Now I want to set the levels of the strings on the left and right side of the _ separator, e.g.

  1. PH < COND < DBO on the left side (LHS).
  2. B < A < C on the right side (RHS).

In addition, I want to specify which side, LHS or RHS, has precedence over the other. Depending on which side has precedence, the overall order of levels will differ:

(1) If levels on LHS is precedent:

[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C

(2) If levels on RHS is precedent:

[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C

Now I just one thought to solve it such like factor(x, levels = c(xx, xx, ...)), but I have more levels than the above shows, so this will look ridiculous.

Note: I don't want to change the order of my data, only the order of the levels.

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51

4 Answers4

3

We can use base R to do this. Using sub remove the substring in the levels of the vector, with match create a numeric index by checking those values that are in the custom order, reassign the levels of the factor by ordering the sequence of levels of vector based on the matching index

i1 <- match(sub("_.*", "", levels(x)), c("PH", "COND", "DBO"))
i2 <- match(sub(".*_", "", levels(x)), c("B", "A", "C"))
factor(x, levels = levels(x)[seq_along(levels(x))[order(i1, i2)]])

For the second case, just reverse the index in order

factor(x, levels = levels(x)[seq_along(levels(x))[order(i2, i1)]])

For repeated use, can be wrapped in a function

f1 <- function(vec, lvls1, lvls2, flag = "former") {
   i1 <- match(sub("_.*", "", levels(vec)), lvls1)
   i2 <- match(sub(".*_", "", levels(vec)), lvls2)

   if(flag == 'former') {
     factor(vec, levels = levels(vec)[seq_along(levels(vec))[order(i1, i2)]])
   } else {
     factor(vec, levels = levels(vec)[seq_along(levels(vec))[order(i2, i1)]])

   }


}

f1(x, c("PH", "COND", "DBO"), c("B", "A", "C"))
#[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
#Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C


f1(x, c("PH", "COND", "DBO"), c("B", "A", "C"), flag = "latter")
#[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
#Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C
akrun
  • 874,273
  • 37
  • 540
  • 662
2

Using CRAN package forcats you can combine a list of factors. The function below expects as input 2 vectors, prefix and suffix, in the order you want them.
Argument sep = "_" has its default set to the separator in the question. You can pass another separator if you want to.

library(forcats)

custom_fct <- function(prefix, suffix, sep = "_"){
  lst <- lapply(prefix, function(p){
    f <- paste(p, suffix, sep = sep)
    factor(f, levels = f)
  })
  fct_c(!!!lst)
}

x <- c("PH", "COND", "DBO")
y <- c("B", "A", "C")

custom_fct(x, y)

Edit.

Another way of seeing the problem, that I only understood after the OP's comment, is to have an input data vector x to be coerced to factor and 2 vectors, one of prefixes and one of suffixes. The following function creates such a vector and does not need an external package.

custom_fct2 <- function(x, prefix, suffix, sep = "_"){
  lst <- lapply(prefix, function(p){
    paste(p, suffix, sep = sep)
  })
  factor(x, levels = unlist(lst))
}

x <- c("DBO_A", "PH_A", "COND_A", "DBO_B",
       "PH_B", "COND_B", "DBO_C", "PH_C", "COND_C")
a <- c("PH", "COND", "DBO")
b <- c("B", "A", "C")

custom_fct2(x, a, b)
#[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C  
#[9] COND_C
#9 Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B ... DBO_C
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
2

Using data.table convenience functions tstrsplit and setorderv.

Create a vector of (arbitrary) column names for the substrings (cols <- c("V1", "V2")). Convert the vector to a data.table (d <- data.table(x)). Split the vector into two columns ((cols) := tstrsplit(x, split = "_")). Set factor levels of substrings (factor(V1, levels = l1)). Order data either by the first substring then the second substring, or by the second and then the first (setorderv(d, if(prec == 1) cols else rev(cols))). Use the ordered column 'x' from the data.table as factor levels of the vector 'x' (factor(x, levels = d$x)).

library(data.table)

f <- function(x, l1, l2, prec){
  cols <- c("V1", "V2")
  d <- data.table(x)
  d[ , (cols) := tstrsplit(x, split = "_")]
  d[ , `:=`(
    V1 = factor(V1, levels = l1),
    V2 = factor(V2, levels = l2))]
  setorderv(d, if(prec == 1) cols else rev(cols))
  factor(x, levels = d$x)
}

# First substring has precedence
f(x, l1 = c("PH", "COND", "DBO"), l2 = c("B", "A", "C"), prec = 1)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C

# Second substring has precedence
f(x, l1 = c("PH", "COND", "DBO"), l2 = c("B", "A", "C"), prec = 2)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C

A base alternative, in a similar vein, but placing substrings in a matrix instead. Use standard regex (see e.g. here) to grab substrings. Convert to factor and set levels. Create column index (i <- c(1, 2, 1)[prec:(prec + 1)]). Order levels of 'x' (as.character(x)[order(m[ , i[1]], m[ , i[2]])])).

f2 <- function(x, l1, l2, prec){
  m <- cbind(factor(sub("_.*", "", x), l1), factor(sub(".*_", "", x), l2))
  i <- c(1, 2, 1)[prec:(prec + 1)]
  factor(x, levels = as.character(x)[order(m[ , i[1]], m[ , i[2]])])}

f2(x, l1, l2, prec = 1)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C

f2(x, l1, l2, prec = 2)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C
Henrik
  • 65,555
  • 14
  • 143
  • 159
-1

How abut something like

x <- with(expand.grid(x = c("DBO", "PH", "COND"), y = c("A", "B", "C")),
          factor(paste(x, y, sep = "_"), levels = paste(x, y, sep = "_")))

You don't need to write out every possible level, just the levels of one side and the other.

Joseph Clark McIntyre
  • 1,094
  • 1
  • 6
  • 6