Combining factor levels in data frame column

Question

I have a data frame data with a column, named "Project License", which represents a categorical variable, and, thus, in R terminology, is a factor. I'm trying to create a new column, where open source software licenses are combined into larger categories per my classification. However, when I try to combine (merge) levels of that factor, I end up either with a column, where all levels are lost, or unchanged, or with an error message, such as the following one:

Error in factor(data[["Project License"]], levels = classification, labels = c("Highly Restrictive", : invalid 'labels'; length 4 should be 1 or 6

Here's my code for this functionality (extracted from a function):

myLevels <- c('gpl', 'lgpl', 'bsd',
              'other', 'artistic', 'public')
myLabels <- c('GPL', 'LGPL', 'BSD',
              'Other', 'Artistic', 'Public')

licenses <- factor(data[["Project License"]],
                   levels = myLevels, labels = myLabels)

data[["Project License"]] <- licenses

classification <- c(highly = c('gpl'),
                    restrictive = c('lgpl', 'public'),
                    permissive = c('bsd', 'artistic'),
                    unknown = c('other'))

restrictiveness <- 
  factor(data[["Project License"]],
         levels = classification,
         labels = c('Highly Restrictive', 'Restrictive',
                    'Permissive', 'Unknown'))

data[["License Restrictiveness"]] <- restrictiveness

I have also tried some other approaches (including ones described in section 8.2.5 in "R Inferno"), but also unsuccessful so far.

What am I doing wrong and how to solve this problem? Thank you!

UPDATE (Data):

> head(data, n=20)
   Project ID Project License
1       45556            lgpl
2       41636             bsd
3       95627             gpl
4       66930             gpl
5       51103             gpl
6       65637             gpl
7       41834             gpl
8       70998             gpl
9       95064             gpl
10      48810            lgpl
11      95934             gpl
12      90909             gpl
13       6538         website
14      16439             gpl
15      41924             gpl
16      78987             gpl
17      58662            zlib
18       1904             bsd
19      93838          public
20      90047            lgpl

> str(data)
'data.frame':   45033 obs. of  2 variables:
 $ Project ID     : chr  "45556" "41636" "95627" "66930" ...
 $ Project License: chr  "lgpl" "bsd" "gpl" "gpl" ...
 - attr(*, "SQL")=Class 'base64'  chr "ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDMxNC5ncm91cHMKV0hFUkUgZ3JvdXBfaWQgPCAxMDAwMDA="
 - attr(*, "indicatorName")=Class 'base64'  chr "cHJqTGljZW5zZQ=="
 - attr(*, "resultNames")=Class 'base64'  chr "UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl"

UPDATE 2 (Data):

> unique(data[["Project License"]])
 [1] "lgpl"       "bsd"        "gpl"        "website"    "zlib"
 [6] "public"     "other"      "ibmcpl"     "rpl"        "mpl11"
[11] "mit"        "afl"        "python"     "mpl"        "apache"
[16] "osl"        "w3c"        "iosl"       "artistic"   "apsl"
[21] "ibm"        "plan9"      "php"        "qpl"        "psfl"
[26] "ncsa"       "rscpl"      "sunpublic"  "zope"       "eiffel"
[31] "nethack"    "sissl"      "none"       "opengroup"  "sleepycat"
[36] "nokia"      "attribut"   "xnet"       "eiffel2"    "wxwindows"
[41] "motosoto"   "vovida"     "jabber"     "cvw"        "historical"
[46] "nausite"    "real"

Please give an example for `data`. – Matthew Lundberg Jun 07 '14 at 14:43 — Matthew Lundberg, Jun 07 '14 at 14:43
what does `unique(data[["Project License"]])` look like? – MrFlick Jun 07 '14 at 14:46 — MrFlick, Jun 07 '14 at 14:46
@MatthewLundberg: Please see UPDATE (Data). – Aleksandr Blekh Jun 07 '14 at 14:50 — Aleksandr Blekh, Jun 07 '14 at 14:50
@MrFlick: Please see UPDATE 2 (Data). – Aleksandr Blekh Jun 07 '14 at 14:52 — Aleksandr Blekh, Jun 07 '14 at 14:52

Matthew Lundberg · Accepted Answer · 2014-06-07T15:32:16.977

The problem is that the number of levels does not equal the number of labels in the factor creation, nor is it length 1.

From ?factor:

labels  
  either an optional character vector of labels for the levels (in the same order as
  levels after removing those in exclude), or a character string of length 1.

You need to make these agree. The names in classification are not a hint to factor to combine the lables.

For example:

factor(..., levels=classification, labels=c('Highly Restrictive',
                                            'Restrictive.1',
                                            'Restrictive.2',
                                            'Permissive.1',
                                            'Permissive.2',
                                            'Unknown'))

To map the factor to another with fewer levels, you can index a vector by name. Turning the classification vector around as a lookup:

 classification <- c(gpl='Highly Restrictive',
                     lgpl='Restrictive', 
                     public='Restrictive',
                     bsd='Permissive',
                     artistic='Permissive',
                     other='Unknown')

To use this as a lookup table:

data[["License Restrictiveness"]] <- 
    as.factor(classification[as.character(data[['Project License']])])

head(data)
##   Project ID Project License License Restrictiveness
## 1      45556            lgpl             Restrictive
## 2      41636             bsd              Permissive
## 3      95627             gpl      Highly Restrictive
## 4      66930             gpl      Highly Restrictive
## 5      51103             gpl      Highly Restrictive
## 6      65637             gpl      Highly Restrictive

Thank you for the answer! I understood the error message. But your solution is not what I want. I'd like to reduce number of factor levels (new factor is OK) to reflect my classification. From what I read, it looks possible, but... — Aleksandr Blekh, Jun 07 '14 at 14:56
Just saw your updated answer. Obviously, my comment was related to the first part only. The second part looks like a nice solution to the problem - will give it a try and report (accept). But, I'm still curious about why other documented solutions (like one from "Inferno") haven't worked for me. Regardless, thank you for your help, again! — Aleksandr Blekh, Jun 07 '14 at 16:43
Sorry, forgot to ask: Could your suggested use of `classification` as a lookup table be transformed in a way that would allow me not to repeat duplicate license values (something like `classification <- c('Highly Restrictive'=c(...), 'Restrictive'=c(...), ...)`? — Aleksandr Blekh, Jun 07 '14 at 16:50
Do you think that `projectLicense()` transformation function, which is based on your suggestion, is the optimal code for a table with almost 2M data records. Also, if you don't mind... The code (in this module) somehow loses my user-defined attributes despite explicit attempt to save them via defining a subclass (`avector`). I would appreciate, if you could take a look and share what you think is wrong with the code: https://github.com/abnova/diss-floss/blob/master/prepare/transform.R. I have a feeling that it's because I manipulate data not as data frame in transformation functions. — Aleksandr Blekh, Jun 19 '14 at 06:29

Karsten W. · Answer 2 · 2014-06-07T23:52:55.150

1

Maybe your task becomes easier if you convert to character first, for instance (untested)

license.map <- c(lgpl="Permissive", bsd="Permissive", 
                 gpl="Restrictive", website="Unkown") # etc.
dat <- transform(dat, LicenseType=license.map[Project.License])

Since by default stringsAsFactor is True, the new column is a factor.

edited Jun 07 '14 at 23:52

answered Jun 07 '14 at 15:32

Karsten W.

17,826
11
69
103

Appreciate your answer! It looks promising - I will definitely think about this approach. However, the column is of `character` class already. You mean to convert again just for the sake of auto-coercion to `factor` class? – Aleksandr Blekh Jun 07 '14 at 15:58
1

you are right, since the column is already character, no need to convert it. Edited the answer. – Karsten W. Jun 07 '14 at 23:54
Just realized that your original coercion to `character` class is needed, if the column has been previously converted to `factor` (and it has). No need to edit the answer again - I just wanted to clarify this aspect. I've implemented the other answer (by @Matthew Lundberg) and will accept it as the answer (as it appeared earlier, plus, subsetting is recommended as a better approach than interactive-focused `transform`). However, your solution is very similar in essence and I appreciate you taking time to help me. – Aleksandr Blekh Jun 08 '14 at 13:54

Combining factor levels in data frame column

2 Answers2