2

I'm trying to define all appropriate data frame columns as factors and the creteria that I have includes what's NOT (by a list of ngrams, see below the code snippet) a factor:

data.clean[,names(data.clean)[grep("^[^time]*[^tot]*[^count]*[^score]*[^include]*[^has]*[^__fe]*$", 
            names(data.clean))]] 
<- as.factor(as.character(data.clean[,names(data.clean)[grep("^[^time]*[^tot]*[^count]*[^score]*[^include]*[^has]*[^__fe]*$", 
                                      names(data.clean))]]))

but it doesn't seem to do the trick. Any suggestions why? thanks

oguz ismail
  • 1
  • 16
  • 47
  • 69
user3628777
  • 529
  • 3
  • 10
  • 20
  • 1
    Could you please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – sgibb May 12 '14 at 14:28

1 Answers1

0

This :

grep("^[^time]*[^tot]*[^count]*[^score]*[^include]*[^has]*[^__fe]*$", names(data.clean))

is not doing what you think it is doing. [^time]* is will match any sequence of characters that does not include 't','i','m' or 'e' anywhere. So the complete expression is anything that matches any of those complemented character classes. For example, abbbccdde will match that expression.

I think what you actually want is :

grep("^(time|tot|count|score|include|has|__fe)$", names(data.clean), invert=TRUE)

This pattern will match exactly the specified ngrams and invert=TRUE will return the complement of the matches, i.e. all the words that did not match the specified ngrams.

sgauria
  • 458
  • 2
  • 7