0

I have a variant of a problem that Reorder levels of a factor without changing order of values does not answer: A variable in a dataset has mixed numbers and strings (I know that this is undesirable, but it's there), like 4 8 16 64 128 default. When building the initial factor, the levels are kept in order (as found, which is sorted).

However when I build subsets (requiring to clean up stale levels), the levels are sorted as strings, like 128 16 4 64 8, even if the subset only contains numeric levels. This is bad when doing a boxplot(var ~ factor).

Trying to use the solutions found in the question cited above (factor(var, levels=sort(var)), the levels ended with duplicates.

Most similar answers assume the levels are known, which is not true in my case. How can I sort the factor so that the levels are sorted.

Example:

> a<-c(1,3,5,7,2)
> b<-c(4,8,16,32,"default")
> df<-data.frame(a, b)
> df$b<-factor(df$b)
> str(df)
'data.frame':   5 obs. of  2 variables:
 $ a: num  1 3 5 7 2
 $ b: Factor w/ 5 levels "16","32","4",..: 3 4 1 2 5
> ss<-subset(df, b != "default")
> factor(ss$b)
[1] 4  8  16 32
Levels: 16 32 4 8
> factor(ss$b,levels=sort(ss$b))
[1] 4  8  16 32
Levels: 16 32 4 8
ss$b<-factor(ss$b,levels=sort(ss$b))
boxplot(ss$a ~ ss$b)
Community
  • 1
  • 1
U. Windl
  • 3,480
  • 26
  • 54
  • 1
    I understand that you want a `factor` always but you don't know which levels you are going to have. The question is: whatever the new levels are, you want to order them first numerically and last `default` if it appears? – R18 Apr 12 '17 at 13:38
  • @R18: I'm undecided on what to do with the non-numeric values, but I want the numeric levels to be in order, i.e. not `16 32 4 8`, but `4 8 16 32`. – U. Windl Apr 12 '17 at 13:49
  • What about replacing `default`by a 0 ? This may avoid your problem. – R18 Apr 12 '17 at 14:22
  • @R18: Kind of, but it's more complicated: Depending on the value of another variable (not present in the example), the value `default` may represent different numbers. – U. Windl Apr 12 '17 at 14:28

2 Answers2

1

Clunky but:

factor(ss$b,levels=sort(unique(as.numeric(as.character(ss$b)))))

Or perhaps more directly

ss <- droplevels(subset(df, b != "default"))

However, I question your assertion that

When building the initial factor, the levels are kept in order (as found, which is sorted).

Seems to me they get sorted alphabetically?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • You may be right with questioning my statement about sorting: Probably (I'm hacking on the script for two days) the initial version had no factors, but plain strings. An I found a similar solution to your proposal. – U. Windl Apr 12 '17 at 14:06
  • I'd like to point out the importance of `as.character()` in this solution: Without, the result is `Levels: 1 2 3 4`, but with it, it's `Levels: 4 8 16 32`. – U. Windl Apr 12 '17 at 14:19
  • For the proposal `ss <- droplevels(subset(df, b != "default"))`: The levels are not sorted after that: `ss$b` shows `Levels: 16 32 4 8` – U. Windl Apr 12 '17 at 14:23
1

One real subset (the original data was too much to paste here) had a factor like this initially (including stale levels):

Levels: 0 128 16 256 32 4 512 64 8 deadline noop

Recomputing the factor (factor(ss$tune.val)), the levels were:

Levels: 128 16 256 32 4 512 64 8

This expression brought the desired result, but it looks a bit complicated to me:

factor(ss$tune.val, levels=sort(as.numeric(levels(factor(ss$tune.val)))))

(...)

Levels: 4 8 16 32 64 128 256 512

Probably unique(...) is better than using levels(factor(...)).

U. Windl
  • 3,480
  • 26
  • 54