2

I should know this, but I don't. And that's because factors in R can be an absolute nightmare. This is a follow-up to my previous question. I'm hoping a few of you might be able to explain in a bit more detail than the R manuals about how to preserve the column attributes when passing a data frame to a custom function. So far, the most useful information I've dug up was from Hadley's Advanced R Programming site. But that section is quite short. Here's what I have:


Edits: I've added the source code to my GitHub (EDIT: link goes to gsub.dataframe.R now). Also, I think I may have a good way to determine whether to set stringsAsFactors = FALSE in the new data frame. Or, as a much easier alternative, I could add a stringsAsFactors argument. Is it possible to use ... for more than one set of further arguments? Like having ... be the further arguments to grep anddata.frame?


Set up some data

set.seed(24)
num <- rep(1, 10); int <- 1:10; fac <- sample(LETTERS[1:3], 10, TRUE)
D <- data.frame(num, int, fac); D$char <- as.character(letters[1:10])

Here's a call to the custom function, and the result.

(newD <- grep.dataframe("6|(a|f)", D, sub = "XXX", ignore.case = TRUE))
#    num int fac char
# 1    1   1 XXX  XXX
# 2    1   2   B    b
# 3    1   3   C    c
# 4    1   4 XXX    d
# 5    1   5 XXX    e
# 6    1 XXX   C  XXX
# 7    1   7 XXX    g
# 8    1   8   B    h
# 9    1   9   B    i
# 10   1  10 XXX    j

I haven't done anything, but have tried everything I can think of, to preserve as much information about the columns as I can (i.e. class(x) <-, attr(x, "name") <-, attributes(x) <-, I(x), etc.). The result you see above is absolutely correct as it reads. However, the result below is troubling. I could use a little help with getting the final data structure to match the original data structure. I'm thinking a switch statement might do the trick?

Note that

> args(grep.dataframe)
function (pattern, X, sub = NULL, ...) 
NULL

with the sub argument calling gsub when not NULL

As always, I appreciate the help.


Note : I took the advice of Hadley (why wouldn't you?) and split this into two functions. My answer below is a new function that only calls gsub for regular expression matching.

Community
  • 1
  • 1
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Do you expect the class of column `int` to remain integer despite inserting `XXX`? (Maybe you expect that `XXX` to be coerced to `NA`?) – jbaums May 26 '14 at 06:17
  • Good question. I haven't thought that through fully, but I'd probably want that one coerced to `character` in case the data frame had a column of mixed alphanumerics (i.e. 24EX6, or something similar). Perhaps `NA` would be better? – Rich Scriven May 26 '14 at 06:18
  • Another main thing is...what the heck happened to the `character` column? – Rich Scriven May 26 '14 at 06:23
  • The behaviour is expected since `gsub` returns character vectors, and `data.frame` by default coerces strings to factors. – jbaums May 26 '14 at 07:27
  • Take a look at [this related solution](http://stackoverflow.com/a/9215233/489704). You could replace your `dc <- data.frame(ap)` with `dc <- colClasses(data.frame(ap), sapply(X, class))`, using `colClasses` as defined at that link. – jbaums May 26 '14 at 07:35
  • After a bit of thought, I think it would be best if a column of `typeof` "integer" or "double" only be changed if the entire column is changed by `gsub`. More `if` statements!! – Rich Scriven May 26 '14 at 07:44
  • What is the goal of your `grep.dataframe` function? Currently it seems to be doing many things and it is not clear what the purpose is. Why does ... need to go to both `grep` and `data.frame()`? Why would you want to do a regular expression replacement on a numeric column? (Also, never use `mode()`, it only exists for S+ compatibility) – hadley May 26 '14 at 21:38
  • @hadley, it's kind of an experiment actually, and so I can improve my skills. With `sub = NULL`, the function returns a named list of matches, by column, showing both the value of the match and the row number in which it was found. I added `sub` later for the option to sub the pattern. When `sub` is a pattern, the original data frame is returned with the pattern sub included. [The bottom of this question](http://stackoverflow.com/questions/23850861/r-undefined-truth-values-in-if-statements) shows a few results. – Rich Scriven May 26 '14 at 22:10
  • @hadley, I also noticed last night that it is possible to change the `class` and `storage.mode` of integers and doubles under certain conditions. Should I be messing with that? – Rich Scriven May 26 '14 at 22:20
  • @RichardScriven Sounds like you need two different functions. And no, you shouldn't need to mess with class, and definitely not `storage.mode` – hadley May 27 '14 at 07:07

1 Answers1

1

Column class problem was solved with this little dandy of a function that re-assigns the classes based on the originals.

.reClass <- function(x, type)
{
    switch(type,
           character = as.character(x),
           integer = as.integer(x),
           factor = as.factor(x),
           numeric = as.numeric(x))
}

> args(gsub.dataframe)
function (pattern, replacement, data, use.nums = FALSE, ...) 
NULL

use.nums is for "use numerics?", whether to replace a pattern on numeric columns. D is the original data being fed to have it's columns pattern-replaced (under certain conditions).

> sapply(D, class)
#        num         int         fac        char 
#  "numeric"   "integer"    "factor" "character" 
> x <- gsub.dataframe("2|A", "XXX", data = D, ignore.case = TRUE)
> x
#    num int fac char
# 1    1   1   C  XXX
# 2    1   2   B    b
# 3    1   3 XXX    c
# 4    1   4 XXX    d
# 5    1   5   C    e
# 6    1   6 XXX    f
# 7    1   7   C    g
# 8    1   8 XXX    h
# 9    1   9   B    i
# 10   1  10 XXX    j
> sapply(x, class)
#       num         int         fac        char 
# "numeric"   "integer"    "factor" "character" 
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245