0

I'm working with a string vector with a structure corresponding to the one below:

messy_vec <- c("0 - 9","100 - 150","21 - abc","50 - 56","70abc - 80")

I'm looking to change a class of this vector to factor which levels would be ordered according to the first digit(s). The code:

messy_vec_fac <- as.factor(messy_vec)

would produce

> messy_vec_fac
[1] 0 - 9      100 - 150  21 - abc   50 - 56    70abc - 80
Levels: 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80

whereas I'm interested in obtaining vector of characteristics:

[1] 0-9 100 - 150 21 - abc 50 - 56 70abc - 80

Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150

As indicated, the order of levels corresponds to the order:

0 21 50 70 100

which are the first digits derived from the elements of the messy vector.

Side points

This is not crucial to the sought solution but it would be good if the proposed solution would not assume the maximum number of digits in the first part of the vector elements. It may happen that the following values occur:

  • 8787abc - 89898 deff - in this case the value 8787 should be used to assert the order
  • 001 def - 1111 OHMG - in this case the value 1 should be used to assert the order
  • It can be safely assumed that all vector elements have - strings: [[:space:]]-[[:space:]]
  • Duplicate values occur

Edits

Following very useful suggestion by CathG I'm trying to cram this solution into a bigger dplyr syntax

# ... %>%
  mutate(very_needed_factor= factor(messy_vec,
                                      levels = messy_vec[
                                        order(
                                          as.numeric(
                                            sub("(\\d+)[^\\d]* - .*", "\\1",
                                                messy_vec)))]))
# %>% ...

But I keep on getting the following error:

Warning messages:
1: In order(as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", c("12-14",  :
  NAs introduced by coercion
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
Konrad
  • 17,740
  • 16
  • 106
  • 167
  • relative to your edit, maybe try with `unique` to define the levels (like in the NB part of my A) ? it seems to be the problem at least for the second point of the error. For the first point, we may need your actual vector so we can reproduce the error because the vector you gave don't give one... (I do get a warning for duplicated levels if I duplicate a value and don't use unique though) – Cath Nov 04 '15 at 13:05
  • in your first warning message it seems you have a value with no space around the hyphen. If so then my regex cannot capture correctly the digits, but just modify the regex to suppress the spaces and it should work – Cath Nov 04 '15 at 13:17
  • 1
    @CathG you are right, for brevity and reproducibility the *messy_vector* I created does not reflect the exact nature of the actual data. But as you said, the workaround would be really simple. – Konrad Nov 04 '15 at 13:19

2 Answers2

4

If I correctly understood what you want to do, you can capture the first digits appearing in each of the string with sub and convert them to numeric to be then used to order the levels in the factor call.

num_vec <- as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", messy_vec))
messy_vec_fac <- factor(messy_vec, levels=messy_vec[order(num_vec)])

messy_vec_fac
#[1] 0 - 9      100 - 150  21 - abc   50 - 56    70abc - 80
#Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150

NB: in case of duplicated values, you can do levels=unique(messy_vec[order(num_vec)]) in the factor call

Cath
  • 23,906
  • 5
  • 52
  • 86
2

Here is another solution

library(magrittr)    
messy_vec <- c("0 - 9","100 - 150","21 - abc","50 - 56","70abc - 80")
ints <- strsplit(messy_vec, "-") %>% 
  unlist() %>% 
  gsub(pattern = "([[:space:]]|[[:alpha:]])*", replacement = "") %>% 
  as.integer() %>% 
  matrix(nrow = 2)
factor(messy_vec, levels = messy_vec[order(ints[1, ], ints[2, ])])
Thierry
  • 18,049
  • 5
  • 48
  • 66